View
6
Download
0
Category
Preview:
Citation preview
TOPICS IN HIGH-DIMENSIONAL REGRESSION ANDNONPARAMETRIC MAXIMUM LIKELIHOOD
METHODS
BY LONG FENG
A dissertation submitted to the
Graduate School—New Brunswick
Rutgers, The State University of New Jersey
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
Graduate Program in Statistics and Biostatistics
Written under the direction of
Cun-Hui Zhang & Lee H. Dicker
and approved by
New Brunswick, New Jersey
October, 2017
ABSTRACT OF THE DISSERTATION
Topics in high-dimensional regression and nonparametric
maximum likelihood methods
by LONG FENG
Dissertation Director: Cun-Hui Zhang & Lee H. Dicker
This thesis contains two parts. The first part, in Chapter 2-4, addresses three connected
issues in penalized least-square estimation for high-dimensional data. The second part, in
Chapter 5, concerns nonparametric maximum likelihood methods for mixture models.
In the first part, we prove the estimation, prediction and selection properties of concave
penalized least-square estimation (PLSE) under fully observed and noisy/missing design,
and validate an essential condition for PLSE: the restricted-eigenvalue condition. In Chapter
2, we prove that the concave PLSE matches the oracle inequalities for prediction and
coefficients estimation of the Lasso, based only on the restricted eigenvalue condition, one of
the mildest condition imposed on the design matrix. Furthermore, under a uniform signal
strength assumption, the selection consistency does not require any additional conditions
for proper concave penalties such as the SCAD penalty and MCP. A scaled version of
the concave PLSE is also proposed to jointly estimate the regression coefficients and noise
level. Chapter 3 concerns high-dimensional regression when the design matrix are subject
to missingness or noise. We extend the PLSE for fully observed design to noisy or missing
design and prove that the same scale of coefficients estimation error can be obtained, while
requiring no additional condition. Moreover, we show that a linear combination of the `2
norm of regression coefficients and the noise level is large enough as penalty level when noise
ii
or missingness exists. This sharpens the commonly understood results where `1 norm of
coefficients is required. Chapter 4 validates the restricted eigenvalue (RE) type conditions
required in Chapter 2 and Chapter 3 and considers a more general groupwise version. We
prove that the population version of the groupwise RE condition implies its sample version
under a low moment condition given usual sample size requirement. Our results include the
ordinary RE condition as a special case.
In the second part, we consider nonparametric maximum likelihood (NPML) methods
for mixture models, a nonparametric empirical Bayes approach. We provide concrete
guidance on implementing multivariate NPML methods for mixture models, with theoretical
and empirical support; topics covered include identifying the support set of the mixing
distribution, and comparing algorithms (across a variety of metrics) for solving the
simple convex optimization problem at the core of the approximate NPML problem. In
addition, three diverse real data applications are provided to illustrate the performance of
nonparametric maximum likelihood methods.
iii
Acknowledgements
I would like to express my deepest gratitude to my advisors, Prof. Cun-Hui Zhang and
Prof. Lee Dicker. I feel extremely fortunate to have the opportunity to work with them.
Prof. Zhang is more than an advisor to me. He is a great researcher, a dedicated
educator, a devoted mentor, a trusted friend and a respectable elder. He provides me
helpful instructions and exceptional research trainning, and also unwavering support and
constant encouragement. More importantly, his devotation to research and to students
makes me interested in exploring a career in academia and to be a researcher and teacher.
Prof. Dicker is the jonior professor I admire most. His brilliant ideas and excellent intuition
on statistics always make my understaning of research questions deeper. He can always
convey complex problems in plain language. I enjoy so much working with him and benefit
a lot in each of our meetings.
Secondly I would like to extend my gradtitude to my dissertation committee, Prof.
Pierre Bellec and Prof. Eitan Greenshtein, for the time they dedicated to review my thesis
and comments on the manuscripts. Special thanks go to Prof. Greenshtein for his helpful
discussion on the topics of nonparametric maximum likelihood estimation in Chapter 5 of
this thesis.
In addition, I want to say thanks to Professor John Kolassa, for his support over the
past five years and Prof. Minge Xie for his advices and encouragement during my study.
Also, I want to thank the fellow students in our department and my friends at Rutgers for
their suggestions and help. I feel very lucky to meet all these people in my graduate life
and have such a happy and unforgettable journey.
iv
Dedication
To my family
v
Table of Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1. High-dimensional regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2. Nonparametric maximum liklihood methods . . . . . . . . . . . . . . . . . . 3
2. Oracle properties of concave PLSE and its scaled version . . . . . . . . . 4
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2. Statistical Properties of Concave PLSE methods . . . . . . . . . . . . . . . 7
2.2.1. Concave penalties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2. The restricted eigenvalue condition . . . . . . . . . . . . . . . . . . . 9
2.2.3. Properties of concave PLSE . . . . . . . . . . . . . . . . . . . . . . . 10
2.3. Smaller penalty levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1. Smaller penalty levels . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2. RE-type conditions for smaller penalty levels . . . . . . . . . . . . . 21
2.3.3. Prediction and estimation errors bounds at smaller penalty levels . . 23
2.4. Scaled concave PLSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1. Description of the scaled concave PLSE . . . . . . . . . . . . . . . . 27
2.4.2. Performance guarantees of scaled concave PLSE at universal penalty
levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
vi
2.4.3. Performance bounds of scaled concave PLSE at smaller penalty levels 32
2.5. Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.1. No signal case: β˚ “ 0 . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.2. Effect of correlation: ranging over different ρ . . . . . . . . . . . . . 35
2.5.3. Effect of signal-to-noise ratio: ranging over different snr . . . . . . . 36
2.5.4. Effect of sparsity: ranging over different α . . . . . . . . . . . . . . . 37
2.6. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3. Penalized least-square estimation with noisy and missing data . . . . . 41
3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2. Theoretical Analysis of PLSE . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.1. Restricted eigenvalue conditions . . . . . . . . . . . . . . . . . . . . 44
3.2.2. Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3. Theoretical penalty levels for missing/noisy data . . . . . . . . . . . . . . . 46
3.4. Scaled PLSE and Variance Estimation . . . . . . . . . . . . . . . . . . . . . 51
3.5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4. Group Lasso under Low-Moment Conditions on Random Designs . . . 56
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2. A review of restricted eigenvalue type conditions . . . . . . . . . . . . . . . 60
4.3. The group transfer principle . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4. Groupwise compatibility condition . . . . . . . . . . . . . . . . . . . . . . . 67
4.5. Groupwise restricted eigenvalue condition . . . . . . . . . . . . . . . . . . . 76
4.6. Convergence of the restricted eigenvalue . . . . . . . . . . . . . . . . . . . . 80
4.7. Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.8. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5. Nonparametric Maximum Likelihood for Mixture Models: A Convex
Optimization Approach to Fitting Arbitrary Multivariate Mixing
Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
vii
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2. NPMLEs for mixture models via convex optimization . . . . . . . . . . . . 92
5.2.1. NPMLEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.2.2. A simple finite-dimensional convex approximation . . . . . . . . . . 93
5.3. Choosing Λ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4. Connections with finite mixtures . . . . . . . . . . . . . . . . . . . . . . . . 96
5.5. Implementation overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.6. Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.6.1. Comparing NPMLE algorithms . . . . . . . . . . . . . . . . . . . . . 99
5.6.2. Gaussian location scale mixtures: Other methods for estimating a
normal mean vector . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.7. Baseball data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.8. Two-dimensional NPMLE for cancer microarray classification . . . . . . . . 104
5.9. Continuous glucose monitoring . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.9.1. Linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.9.2. Kalman filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.9.3. Comments on results . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.10. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
viii
List of Tables
2.1. Median bias of standard deviation estimates. No signal, σ “ 1, ρ “ 0, sample
size n “ 100. Minimum error besides the oracle is in bold for each analysis. 35
5.1. Comparison of different NPMLE algorithms. Mean values (standard
deviation in parentheses) reported from 100 independent datasets; p “ 1000,
throughout simulations. Mixing distribution 1 has constant σj ; mixing
distribution 2 has correlated µj and σj . . . . . . . . . . . . . . . . . . . . . 110
5.2. Mean TSE for various estimators of µ P Rp based on 100 simulated datasets;
p “ 1000. pq1, q2q indicates the grid points used to fit GΛ. . . . . . . . . . . 110
5.3. Baseball data. TSE relative to the naive estimator. Minimum error is in
bold for each analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.4. Microarray data. Number of misclassification errors on test data. . . . . . 111
5.5. Blood glucose data. MSE relative to CGM. . . . . . . . . . . . . . . . . . . 111
ix
List of Figures
2.1. Median standard deviation estimates over different levels of predictor
correlation. σ “ 1, α “ 0.5, snr “ 1, sample size n “ 100, predictors p “
100, 200, 500, 1000 moving from left to right along rows. Plot number refer to
CV L(1), CV SCAD(2), SZ L(3), SZ L2(4), SZ MCP (5), SZ MCP2(6),
SZ MCP3(7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2. Median standard deviation estimates over different levels of signal-to-noise
level. σ “ 1, α “ 0.5, ρ “ 0, sample size n “ 100, predictors p “
100, 200, 500, 1000 moving from left to right along rows. Plot number refer to
CV L(1), CV SCAD(2), SZ L(3), SZ L2(4), SZ MCP (5), SZ MCP2(6),
SZ MCP3(7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3. Median standard deviation estimates over different levels of sparsity.
σ “ 1, snr “ 1, ρ “ 0, sample size n “ 100, predictors p “
100, 200, 500, 1000 moving from left to right along rows. Plot number refer to
CV L(1), CV SCAD(2), SZ L(3), SZ L2(4), SZ MCP (5), SZ MCP2(6),
SZ MCP3(7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4. Five λ0s as functions k, n=100, p=1000. Line numbers refer to (1)
λ0pkq “ tp2nq logppqu12, (2) λ0pkq “ tp2nq logppkqu12, (3) λ0pkq “
p2nq12L1pkpq, (4) Adaptive λ0 described in section 2.5 with various k,
assuming that the correlation between columns of X is 0. (5) Same as (4)
except assuming that the correlation between columns of X is 0.8. The k1
is the solution to (2.37), k2 is the solution to 2k “ L41pkpq ` 2L2
1pkpq. . . . 39
x
5.1. (a) Histogram of 20,000 independent draws from the estimated distribution
of pAj , HjAjq, fitted with the Poisson-binomial NPMLE to all players in the
baseball dataset; (b) histogram of non-pitcher data from the baseball dataset;
(c) histogram of pitcher data from the baseball dataset. . . . . . . . . . . . 104
xi
1
Chapter 1
Introduction
1.1 High-dimensional regression
This first part of this thesis addresses three issues in parameter estimation, prediction and
variable selection for high-dimensional regression: concave penalized least-square regression,
high-dimensional regression with noisy and missing data, and restricted eigenvalue-type
conditions for high-dimensional regression.
As modern technology generates tons of data, high-dimensional data have been studied
intensively both in statistics and computer science. In linear regression, a widely used
approach to analyze high-dimensional data is the penalized least-square estimation (PLSE).
The Lasso, or `1 penalization [71] and the concave penalization, such as the SCAD [21] and
MCP [84] are two mainstream methods in penalized least-square estimation. It is been
shown that the concave PLSE guarantees variable selection consistency under significantly
weaker conditions than the Lasso, for example, the strong irrepresentable condition on the
design matrix required by the Lasso can be replaced by a sparse Riesz condition. Moreover,
the concave PLSE also enjoys rate-optimal error bounds in prediction and coefficients
estimation. However, the error bounds for prediction and coefficients estimation in the
literature still require significantly stronger conditions than what Lasso require, for example,
the knowledge of the `1 norm of the true coefficients vector or the upper sparse eigenvalue
condition. Ideally, selection, prediction and estimation properties should only depend on
lower sparse eigenvalue/restricted eigenvalue, is that achievable? In the second chapter, we
give an affirmative answer to this question.
In Chapter 2, we prove that the concave PLSE matches the oracle inequalities for
prediction and `q coefficients estimation of the Lasso, with 1 ď q ď 2, based only on
the restricted eigenvalue condition, which can be viewed as nearly the weakest available
2
condition on the design matrix. Furthermore, under a uniform signal strength assumption,
the selection consistency does not require any additional conditions for proper concave
penalties such as the SCAD penalty and MCP. Our theorem applies to all the local solutions
that computable by path following algorithm starting from the origin. We also developed
a scaled version of the concave PLSE that jointly estimates the regression coefficients and
noise level. The scaled concave PLSE is not an easy extension of the scaled Lasso because
the joint distribution of regression coefficients and noise level of the former is non-convex.
The computation cost of scaled concave PLSE is negligible beyond computing a continuous
solution path. All our consistency results apply to cases where the number of predictors p
is much larger than the sample size n.
In Chapter 3, we consider high-dimensional regression when the design matrices are not
fully observable. Two specifications are discussed: missing design and noisy design. We
extend the PLSE to noisy or missing design and prove that the same scale of coefficients
estimation error can be obtained compared with the fully observed design, while requiring
no additional condition. Moreover, we prove that a linear combination of the noise level
and `2 norm of coefficients is large enough for penalty level when noise or missing data
exists. This sharpens the existing results where an `1 norm of coefficients is required. We
further extend the scaled version of PLSE to missing and noisy data case. Since the cross-
validation based technique is time consuming and maybe misleading for missing or noisy
data, the proposed scaled solution is of great use.
As discussed before, restricted eigenvalue (RE) type conditions can be viewed as
nearly the weakest available condition on design matrix to guarantee prediction and
estimation performance of the Lasso, concave penalized least-square estimator and
groupwise estimators in high-dimensional regression. In Chapter 4, we prove that the
population version of the groupwise RE condition implies its sample version under: (i)
a second moment uniform integrability assumption on the linear combinations of the design
variables and (ii) a fourth moment uniform boundedness assumption on the individual
design variables and a m-th moment assumption on the linear combinations of the within
group design variables for m ą 2, provided usual sample size requirement. Moreover, the
fourth and m-th moment assumptions can be removed given a slightly larger sample size.
3
Besides, the low moment condition is also sufficient to guarantee the groupwise compatibility
condition, an `1-version of RE condition. Our results include the ordinary RE condition as
a special case. This study demonstrates a benefit of standardizing the design variables in
penalized least squares estimation for heavy-tailed random designs. In addition, it indicates
that the RE condition of bootstrapped sample can be guaranteed given the corresponding
sample RE condition.
1.2 Nonparametric maximum liklihood methods
The second part of this thesis considers two types of models using nonparametric maximum
likelihood (NPML) methods, a nonparametric empirical Bayes approach: NPML methods
for mixture models and NPML methods for linear models.
Nonparametric maximum likelihood (NPML) for mixture models is a technique for
estimating mixing distributions that has a long and rich history in statistics going back to the
1950s, and is closely related to empirical Bayes methods. Historically, NPML-based methods
have been considered to be relatively impractical because of computational and theoretical
obstacles. However, recent work focusing on approximate NPML methods suggests that
these methods may have great promise for a variety of modern applications. Building on this
recent work, we study a class of flexible, scalable, and easy to implement approximate NPML
methods for problems with multivariate mixing distributions. In Chapter 5, we provide
concrete guidance on implementing these methods, with theoretical and empirical support;
topics covered include identifying the support set of the mixing distribution, and comparing
algorithms (across a variety of metrics) for solving the simple convex optimization problem
at the core of the approximate NPML problem. Additionally, we illustrate the methods’
performance in three diverse real data applications: (i) A baseball data analysis (a classical
example for empirical Bayes methods, originally inspired by Efron & Morris), (ii) high-
dimensional microarray classification, and (iii) online prediction of blood-glucose density
for diabetes patients. Among other things, our empirical results clearly demonstrate the
relative effectiveness of using multivariate (as opposed to univariate) mixing distributions
for NPML-based approaches.
4
Chapter 2
Oracle properties of concave PLSE and its scaled version
2.1 Introduction
The purpose of this chapter is to study prediction, coefficient coefficients estimation, and
variable selection properties of concave penalized least squares estimator (PLSE) in linear
regression under the restrictive eigenvalue (RE) condition on the design matrix.
Consider the linear model
y “Xβ˚ ` ε, (2.1)
where X “ px1, ...,xpq P Rnˆp is a design matrix, y P Rn is a response vector, ε P Rn is a
noise vector, and β˚ P Rp is an unknown coefficient coefficients vector. For simplicity, we
assume throughout the chapter that the design matrix is column normalized with xj22 “ n.
We shall focus on penalized loss functions of the form
Lpβ;λq “1
2ny ´Xβ22 `
pÿ
j“1
ρp|βj |;λq, (2.2)
where the penalty function ρpt;λq, indexed by λ ě 0, is concave in t ą 0 with ρp0`;λq “
ρp0;λq “ 0, and the index λ is taken as the penalty level limtÑ0` ρpt;λqt. Additional
regularity conditions on ρp¨; ¨q will be described in Section 2.2. The PLSE can be defined
as a statistical choice among local minimizers of the penalized loss.
Among PLSE methods, the Lasso [71] with the absolute penalty ρpt;λq “ λ|t| is the
most widely used and extensively studied. The Lasso is relatively easy to compute as it is a
convex minimization problem, but it is well known that the Lasso is biased. A consequence
of this bias is the requirement of a neighborhood stability/strong irrepresentable condition
5
on the design matrix X for the selection consistency of the Lasso [51, 88, 72, 79]. Fan and
Li [21] proposed a concave penalty to remove the bias of the Lasso and proved an oracle
property for one of the local minimizers of the resulting penalized loss. Zhang [84] proposed
a path finding algorithm PLUS for concave PLSE and proved the selection consistency
of the PLUS-computed local minimizer under a rate optimal signal strength condition on
the coefficients and the sparse Riesz condition (SRC) [85] on the design. The SRC, which
requires bounds on both the lower and upper sparse eigenvalues of the Gram matrix and is
closely related to the restricted isometry property (RIP) [12], is substantially weaker than
the strong irrepresentable condition. This advantage of concave PLSE over the Lasso has
since become well understood.
For prediction and coefficient estimation, the existing literature somehow presents an
opposite story. Consider hard sparse coefficient vectors satisfying |supppβ˚q| ď s with small
psnq log p. Although rate minimax error bounds were proved under the RIP and SRC
respectively for the Dantzig selector and Lasso in [11] and [85], Bickel et al. [6] sharpened
their results by weakening the RIP and SRC to the RE condition, and van de Geer and
Buhlmann [77] proved comparable prediction and `1 estimation error bounds under an even
weaker compatibility or `1 RE condition. Meanwhile, rate minimax error bounds for concave
PLSE still require two-sided sparse eigenvalue conditions like the SRC [84, 87, 80, 22] or a
proper known upper bound for the `1 norm of the true coefficient vector [46]. It turns out
that the difference between the SRC and RE conditions are quite significant as Rudelson
and Zhou [66] proved that the RE condition is a consequence of a lower sparse eigenvalue
condition alone. This seems to suggest a theoretical advantage of the Lasso, in addition to
its computational simplicity, compared with concave PLSE.
An interesting question is whether the RE condition alone on the design matrix is
also sufficient for the above discussed results for concave penalized prediction, coefficient
coefficients estimation and variable selection, provided proper conditions on the coefficient
coefficients and noise vectors. An affirmative answer of this question, which we provide
in this chapter, amounts to the removal of the upper sparse eigenvalue condition on the
design matrix and actually also a relaxation of the lower sparse eigenvalue condition or
the restricted strong convexity (RSC) condition [56] imposed in [46]. We also extend
6
the prediction and estimation error bounds to smaller penalty levels λ which are more
practical and provide rate minimaxity in prediction and coefficient coefficients estimation
when psnq logppsq is small.
The Lasso still enjoys computational advantages over concave PLSE. However, this
advantage may not be so drastic in many applications in view of the literature on
statistical and computational properties of iterative and path finding algorithms for concave
penalization [25, 89, 84, 87, 9, 1, 32, 56, 80, 46, 22]. In this chapter, we focus on statistical
properties of local solutions of concave PLSE computable by path finding algorithms as we
are also interested in adaptive choice of the penalty level λ in the solution path and the
estimation of the noise level. Exact solution paths of the PLSE can be computed by the
PLUS algorithm [84], while approximate solution paths can be computed by the gradient
decent algorithm of Wang et al. [80] with computational complexity guarantee.
Suppose that a local solution path of the concave penalization problem is obtained, one
still needs to take an appropriate choice of an estimator in the solution path or a proper
penalty level. This problem, which we also study in this chapter, is equivalent to consistent
estimation of the noise level due to scale invariance.
Substantial effort has been made in scale free estimation under the `1 penalty. The idea
is to make the penalty level proportional to the noise level σ. Stadler et al. [67] proposed to
estimate β and σ by maximizing their joint log-likelihood with an `1 penalty on βσ through
reparametrization. In the discussion of [67], Antoniadis [2] proposed to minimize Huber’s
[34] concomitant joint loss function with the `1 penalty on β without reparametrization,
and Sun and Zhang [68] considered a “naive” iteration between the estimation of β and
σ and proved the bias reduction property of one iteration from the joint estimator of [67].
Belloni et al. [5] introduced and studied a square-root Lasso for the estimation of β. It turns
out that for the `1 penalty, Huber’s concomitant joint loss, the equilibrium of the iterative
algorithm, and the square-root Lasso all produce the same estimator. Sun and Zhang [69]
proposed the iterative algorithm as scaled PLSE for joint estimation of β and σ under both
the `1 and concave penalties and studied the scaled Lasso with the joint penalized loss of [2],
especially the consistency and asymptotic normality of the resulting noise level estimator.
However, a theoretical study of the scaled concave PLSE is noticeably missing.
7
A main reason for this absence of a theoretical study of scale free concave PLSE is the
loss of the scale free property; In the joint likelihood, the concomitant loss and the square-
root formulations, it is not proper to use scale free concave penalty functions as they are
not proportional to the penalty level. While the iterative approach is still scale free with
concave penalties, concave regularization is more difficult to study due to the loss of its
equivalence to joint convex minimization, compared with the Lasso.
In this chapter, we find a much weaker condition under which local solutions of concave
PLSE enjoy desired properties in prediction, coefficients estimation, and variable selection
as well. Specifically, we prove that the concave PLSE achieves rate minimaxity in prediction
and coefficients estimation under the `0 sparsity condition on β and the RE condition on X.
Furthermore, the selection consistency can also be guaranteed under an additional uniform
signal strength condition on the nonzero coefficients. In addition, we prove that the same
properties hold for the scaled concave PLSE in the iterative algorithm formulation.
The rest of this chapter is organized as follows. In Section 2.2, we study concave PLSE
under the RE condition on the design. In Section 2.3 we study concave PLSE with smaller
penalty/threshold levels. In Section 2.4 we study theoretical properties of the scaled concave
PLSE. Section 2.5 presents results of an extensive simulation study for variance estimation.
Section 2.6 contains some discussion.
Notation: We denote by β˚ the true regression coefficient coefficients vector, Σ “
XTXn the sample Gram matrix, S “ supppβ˚q the support set of the coefficient
coefficients vector, s “ |S| the size of the support, and Φp¨q the standard Gaussian
cumulative distribution function. For vectors v “ pv1, ..., vpq, we denote by vq “ř
jp|vj |qq1q the `q norm, with v8 “ maxj |vj | and v0 “ #tj : vj ‰ 0u. Moreover,
x` “ maxpx, 0q.
2.2 Statistical Properties of Concave PLSE methods
In this section, we present our results for concave PLSE at a sufficiently high penalty level
to allow selection consistency. We first need to describe our assumptions on the penalty
function and design matrix.
8
2.2.1 Concave penalties
We study the class of concave penalties ρpt;λq satisfying the following properties:
(i) ρpt;λq is symmetric, ρpt;λq “ ρp´t;λq;
(ii) ρpt;λq is monotone, ρpt1;λq ď ρpt2;λq for all 0 ď t1 ă t2;
(iii) ρpt;λq is left- and right-differentiable in t for all t;
(iv) ρpt;λq has selection property, 9ρp0`;λq “ λ;
(v) | 9ρpt´;λq| _ | 9ρpt`;λq| ď λ for all real t.
We write 9ρpt;λq “ x when x is between the left- and right-derivative of ρpt;λq at t, including
t “ 0 where 9ρp0;λq “ x means |x| ď λ. We use the following quantities to measure the
concavity of penalty functions. For a given penalty function ρp¨;λq, define the maximum
concavity at t as
κpt; ρ, λq “ supt1ą0
9ρpt1;λq ´ 9ρpt;λq
t´ t1, (2.3)
where the supreme is taken over all possible choices of 9ρpt;λq and 9ρpt1;λq between the left-
and right-derivatives. Further, define the overall maximum concavity of ρp¨;λq as
κpρq “ κpρ, λq “ maxtě0
κpt; ρ, λq. (2.4)
Many popular penalties satisfy conditions (i) to (v). We illustrate the SCAD (smoothly
clipped absolute deviation) penalty and MCP (minimax concave penalty) as examples. The
SCAD penalty [21] is defined as
ρpt, λq “ λ
ż |t|
0
"
Ipx ď λq `pγλ´ xq`pγ ´ 1qλ
Ipx ą λq
*
dx (2.5)
with a fixed parameter γ ą 2. A straightforward calculation yields κp0; ρ, λq “ 1γ and
κpρ, λq “ 1pγ ´ 1q for the SCAD penalty. The MCP [84] is defined as
ρpt, λq “ λ
ż |t|
0p1´
x
λγq`dx (2.6)
9
with γ ą 0 and κpρ, λq “ κp0; ρ, λq “ 1γ.
2.2.2 The restricted eigenvalue condition
We now consider conditions on the design matrix. The restricted eigenvalue (RE) condition,
proposed in [6], can be viewed as nearly the weakest available condition on the design to
guarantee rate optimal prediction and coefficients estimation performance of the Lasso. The
RE coefficient RE2pS, ηq for the `2 estimation loss can be defined as follows: For η P r0, 1q
and δ˚ P r0, 1s,
RE22pS; η, δ˚q “ inf
"
uTΣu
u22: p1´ ηquSc1 ď p1` δ˚ηquS1
*
. (2.7)
The RE condition refers to the property that RE2pS; η, δ˚q is no smaller than a certain
positive constant for all n and p. For the prediction and `1 estimation, an `1-version of the
RE can be employed. The following compatibility or `1-RE coefficient [77] can be used,
RE21pS; η, δ˚q “ inf
"
uTΣu|S|uS
21
: p1´ ηquSc1 ď p1` δ˚ηquS1
*
. (2.8)
We introduce a relaxed cone invertibility factor (RCIF) for prediction as
RCIFpredpS; η,ωq “ inf
"
Σu28|S|uTΣu
: p1´ ηquSc1 ď ´ωTSuS
*
, (2.9)
where ω P Rp, and a RCIF for the `q estimation, 1 ď q ď 2, as
RCIFest,qpS; η,ωq “ inf
#
Σu8|S|1q
uq: p1´ ηquSc1 ď ´ω
TSuS
+
. (2.10)
The choices of δ˚ and ω depend on the problem under consideration in the analysis, but
typically we have ω8 ď 1`δ˚η so that the minimization in (2.9) and (2.10) is taken over a
smaller cone. For example, one may take ωS “ 0 for studying selection consistency. We will
use an RE condition to prove cone membership of the estimation error of the concave PLSE
and the RCIF to bound the prediction and coefficients estimation errors. The following
proposition shows that the RCIF may provide sharper bounds than the RE does.
10
Proposition 2.1. Let RE, RCIF be as in (2.7)-(2.10), η P p0, 1q, and ξ “ p1`δ˚ηqp1´ηq.
If ωS8 ď 1` δ˚η, then
RCIFpredpS; η,ωq ě RE21pS; η, δ˚qp1` ξq
2
RCIFest,1pS; η,ωq ě RE21pS; η, δ˚qp1` ξq
2 (2.11)
RCIFest,2pS; η,ωq ě RE1pS; η, δ˚qRE2pS; η, δ˚qp1` ξq.
Proof of Proposition 2.1. Since ωS8 ď 1` δ˚η, we have
p1´ ηquSc1 ď ´ωTSuS ď p1` δ˚ηquS1.
It then follows that
Σu28|S|uTΣu
ěuTΣu|S|u21
ěuTΣu|S|
p1` ξq2uS21
.
The first inequality of (2.11) is obtained by taking infimum in the cone C pS; η, δ˚q “ tu :
p1´ ηquSc1 ď p1` δ˚ηquS1u. Similarly,
Σu8|S|u1
ěuTΣu|S|u21
ěuTΣu|S|
p1` ξq2uS21
,
Σu8|S|12
u2ěuTΣu|S|u1u2
ěuTΣu|S|
p1` ξquS1u2.
The second and third inequality of (2.11) can be obtained by taking the infimum in the
cone C pS; η, δ˚q on the above inequalities. ˝
2.2.3 Properties of concave PLSE
As our analysis directly allows the penalty to depend on index j, we consider the follows
generalization of the penalized loss (2.2),
Lpβ;λq “1
2ny ´Xβ22 `
pÿ
j“1
ρjpβj ;λq.
11
Given penalty functions ρjp¨; ¨q and a penalty level λ, a vector pβ P Rp is a critical point
of the penalized loss (2.2) if the following local Karush-Kuhn-Tucker (KKT) condition is
satisfied:
xTj py ´Xpβqn “ 9ρjppβj ;λq (2.12)
for a certain version of 9ρjppβj ;λq (between the left and right derivatives as in our convention)
for every j “ 1, . . . , p. By property (v) of the penalty, (2.12) is well defined and | 9ρjppβj ;λq| ď
λ. When the penalized loss is convex in t, the local KKT condition (2.12) is necessary and
sufficient for the global minimization of the penalized loss Lp¨;λq. In general, solutions of
(2.12) include all local minimizers of Lp¨;λq.
For positive λ˚ and κ˚, consider the class of all penalty functions ρjp¨;λq with no smaller
penalty level than λ˚ and no greater concavity than κ˚,
Ppλ˚, κ˚q “!
ρjp¨;λq : λ ě λ˚, κpρj , λq ď κ˚
)
. (2.13)
Among all local solutions for all such penalties ρjp¨;λq in Ppλ˚, κ˚q, we shall focus on the
subclass B0pλ˚, κ˚q of those connected to the origin through a continuous path of such
solutions. Formally, let
B “ Bpλ˚, κ˚q “!
pβ: (2.12) holds with some ρjp¨;λq PPpλ˚, κ˚q)
.
The class B0pλ˚, κ˚q can be written as
B0pλ˚, κ˚q “!
pβ : pβ and 0 are connected in Bpλ˚, κ˚q)
. (2.14)
As pβ “ 0 is the sparsest solution, B0 can be viewed as the sparse branch of the solution
space B.
By definition, B0pλ˚, κ˚q is the set of all local solutions computable by path following
algorithms starting from the origin, with constraints λ ě λ˚ and κpρj , λq ď κ˚ on the
penalty and concavity levels respectively. This is a large class of estimators as it includes all
12
local solutions connected to the origin regardless of the specific algorithms used to compute
the solution and different types of penalties can be used in a single solution path. For
example, the Lasso estimator belongs to the class as it is connected to the origin through
the LARS algorithm [58, 59, 20]. The SCAD and MCP solutions belong to the class if they
are computed by the PLUS algorithm [84] or by a path following algorithm from the Lasso
solution.
The following theorem studies the difference between solutions pβ P B0pλ˚, κ˚q and an
oracle coefficient vector βo satisfying supppβoq Ď S under the RE condition on the design
matrix. The vector βo P Rp can be taken as the true regression coefficient coefficients vector
β˚ so that Theorem 2.1 directly yields prediction and estimation error bounds under the
RE condition. Alternatively, βo can be taken as the oracle LSE pβo
given by
pβo
S “ pXTSXSq
´1XTSy,
pβo
Sc “ 0, (2.15)
with S “ supppβ˚q, so that Theorem 2.1 directly yields sufficient conditions for selection
consistency and indirectly sharper prediction and estimation error bounds, still under the
RE condition.
We consider here penalty levels no smaller than a certain λ˚ satisfying
XTScpy ´XTβoqn8 ă ηλ˚, X
TS py ´X
Tβoqn8 ď ηδ˚λ˚, (2.16)
where η ă 1 and δ˚ ď 1. When ε “ y ´Xβ˚ „ Np0, σ2Inˆnq and
λ˚ “ pσηqa
p2nq log p,
(2.16) holds with at least probability 1´a
2pπ log pq, as xj are normalized to xj2 “?n,
provided that βo is either the true β˚ with δ˚ “ 1 or the oracle LSE in (2.15) with δ˚ “ 1
δ˚ “ 0. Smaller penalty levels will be considered in Section 2.3.
We study the difference between solutions pβ P B0pλ˚, κ˚q and the oracle coefficient
13
coefficients vector βo via a random vector ω “ ωpβo, λq with elements
wj “ 9ρjpβoj ;λqλ´ x
Tj py ´Xβ
oqpλnq. (2.17)
The relevance of ω can be clearly seen from the definition of pβ in (2.12) as
xTj Xpβo ´ pβqn “ λwj ` 9ρjppβj ;λq ´ 9ρjpβ
oj ;λq (2.18)
We may choose ω to satisfy supppωq Ď S in our convention as 9ρjpβoj ;λq is allowed to take
any value in r´λ, λs for βoj “ 0. However, this choice is not used in our analysis. Let
φminpMq denote the minimum eigenvalue of symmetric matrices M .
Theorem 2.1. Let pβ be a solution of (2.12) in B0pλ˚, κ˚q with penalties ρjp¨;λq in the
class Ppλ˚, κ˚q. Suppose RE22pS; η, δ˚q ě κ˚ and (2.16) holds for certain βo P Rp and
S Ě supppβoq. Let ω be as in (2.17).
(i) With ξ “ p1` δ˚ηqp1´ ηq,
Xpβ ´Xβo22n ďp1` ηq2λ2|S|
RCIFpredpS; η,ωqďp1` ξq2p1` ηq2λ2|S|
RE21pS; η, δ˚q
, (2.19)
pβ ´ βoq ď
$
’
’
’
’
’
’
’
’
’
’
’
’
’
&
’
’
’
’
’
’
’
’
’
’
’
’
’
%
p1` ηqλ|S|RCIFest,1pS; η,ωq
ďp1` ξq2p1` ηqλ|S|
RE21pS; η, δ˚q
, q “ 1,
p1` ηqλ|S|12
RCIFest,2pS; η,ωqď
p1` ξqp1` ηqλ|S|12
RE1pS; η, δ˚qRE2pS; η, δ˚q, q “ 2,
p1` ηqλ|S|1q
RCIFest,qpS; η,ωq, q ě 1.
(2.20)
(ii) Suppose maxjďp κpβoj ; ρj , λq ď p1´ 1C0qRE2
2pS; η, δ˚q. Then,
Xpβ ´Xβo22n ď pC0λq2 supu‰0
“
ωTSuS ´ p1´ ηquSc1‰2
`
uTΣu(2.21)
14
and for any seminorm ¨ as a loss function
pβ ´ βo ď C0λ supu‰0
u“
ωTSuS ´ p1´ ηquSc1‰
uTΣu. (2.22)
(iii) Suppose βo is a solution of (2.12) or equivalently ωS “ 0. Then,
pβSc “ 0 and sgnppβjqsgnpβoj q ě 0 @ j P S. (2.23)
If κp0; ρj , λq ă φminpXTSXSnq, then
sgnppβq “ sgnpβoq. (2.24)
If maxjPS κpβoj ; ρj , λq ă φminpX
TSXSnq, then
pβ “ βo. (2.25)
Remark 2.1. In the above theorem, one may also use a relaxed version of RE1 and RE2
with the constraint replaced by p1´ ηquSc1 ď ´ωTSuS .
Corollary 2.1. Suppose 9ρjpt;λq “ 0 for |t| ą λγ and conditions of Theorem 2.1 (ii) hold
with βo “ pβo
being the oracle estimator in (2.15). Then, (2.21) and (2.22) hold with ω22 ď
p1 ` δ˚ηq2|S1| where S1 “ tj P S : |pβoj | ď λγu. Consequently, when C2
0RE22pS; η, δ˚q “
OP p1q and λ À σa
plog pqn,
Xpβ ´Xpβo22n`
pβ ´ pβo22 “ OP pσ
2nq|S1| log p,
implying pβ “ pβo
when |S1| “ 0, and
Xpβ ´Xβ˚22n` pβ ´ β˚22 “ OP pσ
2nq`
|S1| log p` |S|˘
.
Theorem 2.1 gives a unified treatment of penalized least squares methods, including
the `1 and concave penalties, under the RE condition on the design matrix and natural
conditions on the penalty. For prediction and coefficient coefficients estimation, (2.19)
15
and (2.20) match those of state-of-art for the Lasso in both the convergence rate and the
regularity condition on the design, while (2.21), (2.22) and Corollary 2.1 demonstrate the
advantages of concave penalization when |S1| is of smaller order than |S|. Moreover, the
prediction and estimation error bounds in Theorem 2.1 (ii) and Corollary 2.1 directly and
naturally provide selection consistency when ωS “ 0 or |S1| “ 0. More precisely, for
selection consistency, Theorem 2.1 (iii) requires only the RE condition for (2.23) and mild
additional eigenvalue conditions for (2.24) and (2.25), provided the existence of an oracular
solution βo with supppβoq Ď S or equivalently ωS “ 0. Note that κp0; ρj , λq ď κ˚ and
RE22pS; 0q ď φminpX
TSXSnq by definition. For concave penalties, this condition ωS “ 0 can
be fulfilled by the rate-optimal signal strength condition minjPS |pβoj | ą γλ as in Corollary 2.1.
However, the condition ωS “ 0 for the Lasso requires more restrictive `8-type conditions
such as the irrepresentable condition on the design. These RE-based results are new and
significant as the existing theory for concave penalization, which requires substantially
stronger conditions on the design such as the sparse Riesz condition in [84], leaves a false
impression that the Lasso has a technical advantage in prediction and parameter estimation
under the RE condition on the design. Moreover, compared with existing analysis, the proof
of Theorem 2.1 is much simpler.
For the Lasso, κ˚ “ 0 ď RE2pS; η, δ˚q always holds and C0 “ 1 in Theorem 2.1 (ii),
which implies the following corollary due to ωS8 ď 1` η.
Corollary 2.2. Let pβ be the Lasso estimator. If (2.16) holds for a coefficient vector βo P Rp
with S Ě supppβoq, then
Xpβ ´Xβo22np1` ηq2λ2
ď supu‰0
ψ2puq
uTΣuď max
"
|S|p1´ 1ξq2
RE21pS; η, δ˚q
,|S|
RE21pS; 0q
*
with ψpuq ““
uS1 ´ uSc1ξ‰
`and ξ “ p1` ηqp1´ ηq, and
pβ ´ βo2p1` ηqλ
ď supu‰0
u2ψpuq
uTΣuď max
"
|S|12p1´ 1ξq
RE21,2pS; η, δ˚q
,|S|12
RE21,2pS; 0q
*
,
where RE1,2pS; η, δ˚q “ tRE2pS; η, δ˚qRE1pS; η, δ˚qu12 ě RE2
2pS; η, δ˚q.
In fact, the sharper prediction and coefficient coefficients estimation error bounds in
16
Corollary 2.2 are the sharpest possible based on the basic inequality uTΣu ď ψpuq for
u “ ppβ´βoqλ. For example, it is strictly sharper than the familiar prediction error bound
in [77],
Xpβ ´Xβo22n ď p1` ηq2λ2|S|RE2
1pS; η, δ˚q,
when RE21pS; η, δ˚q ă RE2
1pS; 0q.
To prove Theorem 2.1, we first present the following lemma.
Lemma 2.1. Let S Ă t1, . . . , pu, λ ą 0, pβ be a solution of (2.12), and βo a coefficient
vector satisfying supppβoq Ď S. Let h “ pβ ´ βo, ω be as in (2.17) and z “ py ´Xβoqn.
Then,
hTΣh “ř
jhjt 9ρjpβoj ;λq ´ 9ρjppβj ;λq ´ λwj
(
ďř
jPScpzjhj ´ λ|hj |q ´ λωTShS `
ř
jκpβoj ; ρj , λqh
2j (2.26)
and |wj | ď 1` |zj |λ for j P S, where κpβoj ; ρj , λq is as in (2.3).
Proof of Lemma 2.1. Recall that h “ pβ´βo and z “ py´Xβoqn with a βo satisfying
supppβoq Ď S. For j P Sc, hj “ pβj , so that by (2.17)
hjt 9ρjpβoj ;λq ´ 9ρjppβj ;λq ´ λwju
“ pβjtzj ´ 9ρjppβj ;λqu
ď pβjzj ´ |pβj | 9ρjp|pβj |;λq (2.27)
ď pβjzj ´ λ|pβj | ` p|pβj | ´ 0q
9ρjp0`;λq ´ 9ρjp|pβj |;λq(
ď zjhj ´ λ|hj | ` κp0; ρj , λqh2j .
For j P S,
hjt 9ρjpβoj ;λq ´ 9ρjppβj ;λq ´ λwju ď ´λwjhj ` κpβ
oj ; ρj , λqh
2j . (2.28)
Summing the above inequalities over j, we find via (2.18) that (2.26) holds. Moreover, by
the definition of ω in (2.17), |wj | ď 1` |zj |λ for j P S. ˝
17
Proposition 2.2. Let S Ě supppβoq, tη, δ˚u be as in (2.16),
C pS; η, δ˚q “
u : p1´ ηquSc1 ď p1` δ˚ηquS1(
,
and B˚0 pλ˚, κ˚q “ Bpλ˚, κ˚q X tβ
o ` C pS; η, δ˚qu be the set of all solutions pβ of (2.12)
with penalties in Ppλ˚, κ˚q and estimation error pβ ´ βo in the cone C pS; η, δ˚q. Let
pβ P B˚0 pλ˚, κ˚q with penalty level λ and rβ P Bpλ˚, κ˚q with penalty level rλ. Suppose
RE22pS; η, δ˚q ě κ˚ and (2.16) holds. Let ε1 “ pη ´ zSc8λ˚q2, ε2 “ ε1p2κ˚q and
ε0 “ mintε2, ε1ε2p1` ηqu. Then,
›
›prβ ´ βoqrλ´ ppβ ´ βoqλ›
›
1ď ε0 ñ rβ P B˚
0 pλ˚, κ˚q.
Proposition 2.2 asserts that among general solutions pβ of (2.12) in Bpλ˚, κ˚q, those
with the normalized error ppβ ´ βoqλ inside the cone C pS; η, δ˚q and outside the cone are
separated by ε0 in the `1 distance of the normalized error. Thus, if pβptq
is a sequence of such
solutions with penalty levels λptq such that the normalized errors uptq “ ppβptq´βoqλptq have
small `1 increments, uptq ´upt´1q1 ď ε0, then uptq are either all in the cone C pS; η, δ˚q or
all outside cone. In particular, Proposition 2.2 implies that the solutions pβ P B0pλ˚, κ˚q
has the cone property pβ ´ βo P C pS; η, δ˚q or equivalently B0pλ˚, κ˚q Ď βo ` C pS; η, δ˚q,
as λ‘ pβ is connected to λp0q ‘ 0 through a continuous path and the origin 0 has the cone
property.
Proof of Proposition 2.2. Let u “ ppβ ´ βoqλ and v “ prβ ´ βoqλ. We want to prove
that
u´ v1 ď ε0 and u P C pS; η, δ˚q imply v P C pS; η, δ˚q. (2.29)
By the definition of ε1 and condition (2.16), we have ε1 ą 0. As κpβoj ; ρj , λq ď κ˚ and
zSc8λ ď zSc8λ˚ ď η ´ 2ε1, Lemma 2.1 implies that
uTΣu``
1´ η ` 2ε1˘
uSc1
18
ď ´ωTSuS `
maxjκpβoj ; ρj , λq
(
u22 (2.30)
ď`
1` δ˚η˘
uS1 ` κ˚u22
and that the same inequalities also hold for v. Recall that ε2 “ ε1p2κ˚q. If u1 ď ε2 and
v ´ u1 ď ε2, then v1 ď ε1κ˚, so that the v-version of (2.30) implies
`
1´ η ` 2ε1˘
vSc1 ď`
1` δ˚η˘
vS1 ` κ˚v21 ď
`
1` δ˚η˘
vS1 ` ε1v1,
or equivalently
`
1´ η ` ε1˘
vSc1 ď`
1` δ˚η ` ε1˘
vS1,
which then implies v P C pS; η, δ˚q. Because u P C pS; η, δ˚q, we have κ˚u22 ď u
TΣu by
the RE condition, so that (2.30) implies
`
1` 2ε1 ´ η˘
uSc1 ď`
1` δ˚η˘
uS1.
Due to p1` ε1 ´ ηqp1´ ε1 ` δ˚ηq ď p1` 2ε1 ´ ηqp1` δ˚ηq, we have
`
1` ε1 ´ η˘
uSc1 ď`
1´ ε1 ` δ˚η˘
uS1.
If u1 ą ε2 and v ´ u1 ď ε1ε2p1` δ˚ηq, then v P C pS; η, δ˚q follows from
`
1´ η˘
vSc1 ´`
1` δ˚η˘
vS1
ď`
1´ η˘
uSc1 ´`
1` δ˚η˘
uS1 ` p1` δ˚ηqv ´ u1
ď`
1´ η˘
uSc1 ´`
1` δ˚η˘
uS1 ` ε1u1
ď 0.
Thus, (2.29) holds with ε0 “ mintε2, ε1ε2p1` ηqu. ˝
Proof of Theorem 2.1. Let h “ pβ ´ βo and u “ hλ. It follows from Proposition 2.2
that u P C pS; η, δ˚q as λ ‘ pβ is connected to λp0q ‘ 0 through a continuous path and the
origin has the cone property. Let 1 ď C0 ď 8 satisfying the condition maxj κpβoj ; ρj , λq ď
19
p1´ 1C0qREpS, ηq. As u P C pS; η, δ˚q, uTΣu ě REpS, ηqu22, so that by (2.30)
C´10 uTΣu` p1´ ηquSc1 ď ´ω
TSuS ď ωS2u2. (2.31)
This immediately implies (2.21) and (2.22) with u “ ppβ´βoqλ. For (2.19) and (2.20), we
set C0 “ 8. However, by the definition of RCIF,
RCIFpredpS; η,ωquTΣu ď Σu28|S|.
Consequently, the first inequality in (2.19) follows from the fact that
Σu8 ď XT py ´Xpβqn8λ` X
T py ´Xβoqn8λ ď 1` η,
and the second inequality follows from the first inequality in (2.11). Similarly, the first
inequality in (2.20) follows from
RCIFest,qpS; η,ωquq ď Σu8|S|1q ď p1` ηq|S|1q,
and the second follows from the second and third inequalities in (2.11).
Finally we consider selection consistency under the assumption ωS “ 0. In this case,
βo is a solution of (2.12), and hSc1 “ 0 by (2.31). Moreover, because both βo and pβ are
solutions of (2.12) with support in S,
κ˚hS22 ď h
TSΣShS “ ´
ÿ
jPShj
9ρjppβj ;λq ´ 9ρjpβoj ;λq
(
ďÿ
jPSκpβoj ; ρj , λqh
2j .
As κpβoj ; ρj , λq ď κ˚, the maximum concavity is attained above at every j P S in the sense
of ´ 9ρjppβj ;λq ` 9ρjpβoj ;λq “ κ˚ppβj ´ βoj q for all j P S with hj ‰ 0. This is possible only
when sgnppβjqsgnpβoj q ě 0 for all j P S. Furthermore, sgnppβjqsgnpβoj q ą 0 for all j P S when
κp0; ρj , λq ă φminpXTSXSnq, and hS “ 0 when maxjPS κpβ
oj ; ρj , λq ă φminpX
TSXSnq. ˝
20
2.3 Smaller penalty levels
We have studied in Section 2.2 exact solutions of (2.12) for penalty levels λ ě λ˚ in the
event where λ˚ is a strict upper bound of the supreme norm of the random vector z “
XT py ´Xβoqn as in (2.16). Such penalty or threshold levels are commonly used in the
literature to study regularized methods in high-dimensional regression. However, this is
quite conservative and often yields poor numerical results. In this section, we consider
smaller penalty levels under somewhat stronger RE conditions on the design.
2.3.1 Smaller penalty levels
We consider penalty levels λ which control a sparse `2 norm of a truncated z “ XT py ´
Xβoqn, instead of the larger `8 norm of z. For q P r1,8s and t ą 0, the sparse `q norm
is defined as
vpq,tq “ maxJĂt1,...,pu,|J |ăt`1
vJq.
To control the effect of the noise, we consider penalty levels λ ě λ˚ with a minimum penalty
level λ˚ such that
›
›
›p|z| ´ η0λ˚q`
›
›
›
p2,mq“ sup|J |“m
b
ř
jPJ
`
|zj | ´ η0λ˚˘2
`ă η1m
12λ˚ (2.32)
happens with high probability for certain positive numbers η0 and η1 satisfying η0` η1 ă 1
and a positive integer m. It is clear that (2.16) implies (2.32) with η “ η0 ` η1 and m “ 1.
As properties of the Lasso has been considered in [70] under penalty levels λ ą λ˚ with the
smaller λ˚ in (2.32), the results in this subsection can be viewed as an extension of their
results to general solutions of (2.12) in the set B0pλ˚, κ˚q in (2.14).
With η “ η0 ` η1 and z “XT py ´Xβoqn, define
S “ tj P t1, ..., pu : |zj | ě ηλ˚u (2.33)
21
be the set of indexes of large |zj |s. Main consequences of (2.32) are
|S| ă m,ř
jPSp|zj | ´ η0λ˚q2` ă mpη1λ˚q
2, zSc8 ă ηλ˚. (2.34)
These properties can be used to prove
Xpβ ´Xβo22n À`
ωS22 `m
˘
λ2, (2.35)
with ωS22 À |S| in the worst case scenarior scenario, and parallel estimation error bounds
under a certain RE-type condition. See [70] and Subsection 3.3.
Consider Gaussian noise ε “ y ´Xβ˚ „ Np0, σ2Inˆnq. Let L1ptq “ Φ´1p´tq be the
standard normal negative quantile function. Sun and Zhang [70] proved that when βo is
the true coefficient vector, βo “ β˚, (2.32) holds with at least probability 1 ´ ε under the
conditions
η0λ˚ “ pσn12qL1pkpq, (2.36)
η1
η0ą
ˆ
4km
L41pkpq ` 2L2
1pkpq
˙12
`L1pεpq
L1pkpq
ˆ
κ`pmq
m
˙12
,
where κ`pmq “ maxtuTΣu : u0 “ m, u2 “ 1u is the upper sparse eigenvalue of Σ. A
conservative choice of k is to take
k “ L41pkpq ` 2L2
1pkpq (2.37)
as in [70], giving m “ Op1q in prediction and estimation error bounds. However, by (2.35),
larger k can be taken without changing the order of error bounds as long as m À ωS22.
2.3.2 RE-type conditions for smaller penalty levels
When a smaller penalty level is taken, a lower level of regularization is imposed on the
estimator pβ, so that the estimation error h “ pβ´βo may fail the condition p1´ ηqhSc ď
p1 ` ηqhS1 in the definition of the restricted eigenvalue in (2.7). However, in the event
22
(2.32), we can still prove the membership of the error h in the following larger cone,
U pS, η0, η1,mq
“
!
u :`
1´ η˘
uSc1 ď p1` ηquS1 ` η1pm12uS2 ´ uS1q
)
with η “ η0 ` η1 ă 1 and the set S in (2.33). This will be verified in the proof of Theorem
2.2 but can be also vaguely seen from (2.34). Consequently, the restricted eigenvalue is
defined in the larger cone as
ĎRE2pS; η0, η1,mq “ inf
#
puTΣuq12
u2: 0 ‰ u P U pS, η0, η1,mq
+
. (2.38)
When m “ 1, S “ H and the restricted eigenvalue (2.38) coincides with the original RE as
defined in (2.7). Although (2.38) is a random variable due to its dependence on S (even for
deterministic designs), it is no smaller than
ĎRE˚,2pS; η,mq “ min|T zS|ăm
inf
#
puTΣuq12
u2: uT c1 ă ξ|T |12uT 2
+
(2.39)
due to |S| ă m in (2.34), where ξ “ p1` ηqp1´ ηq.
Similarly, we extend the relaxed cone invertibility factor (RCIF) as
ĞRCIFpredpS; η0, η1,mq “ inf!
Σu2cs˚
uTΣu: u P U pS; η0, η1,mq
)
ĞRCIFest,qpS; η0, η1,mq “ inf!
Σucps˚q1q
uq: u P U pS, η0, η1,mq
)
, (2.40)
where s˚ “ max
|S|, |S|(
represents a potentially lower level of sparsity due to possible
selection of variables outside S and ¨ c is a combination of the `2 norm on S and the `8
norm on Sc defined as
vc “ max
vS2m12, vSc8
(
.
The new RCIF for prediction and estimation are respectively. Whenm “ 1, the combination
norm coincides with the `8 norm and the modified RCIFs coincide with those in (2.7) and
23
(2.10) respectively.
2.3.3 Prediction and estimation errors bounds at smaller penalty levels
Theorem 2.2. Let pβ be a solution of (2.12) in B0pλ˚, κ˚q with penalties ρjp¨;λq P
Ppλ˚, κ˚q. Let η “ η0 ` η1 ă 1 with positive η0 and η1, m be a positive integer, S
as in (2.33) with a certain βo P Rp, S Ě supppβoq, and s˚ “ max
|S|, |S|(
. Suppose
ĎRE22pS; η0, η1,mq ě κ˚ and (2.32) holds. Then,
Xpβ ´Xβ˚22n ď
p1` ηqλ(2s˚
ĞRCIFpredpS; η0, η1,mqď
p1` ηqξ1λ(2s˚
ĎRE22pS; η0, η1,mq
(2.41)
with ξ1 ““
2p|S|s˚q12 ` p1´ η0qpms˚q12
‰
`
1´ η˘
, and
pβ ´ β˚q ď
$
’
’
’
’
’
’
&
’
’
’
’
’
’
%
p1` ηqλps˚q12
ĞRCIFest,2pS; η0, η1,mqďp1` ηqξ1λps
˚q12
ĎRE22pS; η0, η1,mq
, q “ 2,
p1` ηqλps˚q1q
ĞRCIFest,qpS; η0, η1,mq, @q P r1, 2s.
(2.42)
Remark 2.2. When m — s, k is on the same order k — m. The penalty level λ˚ in (2.37) is
on the ordera
p2nq logppkq. Theorem 2.2 guarantees that the prediction and `2 estimation
error are on the order
Xpβ ´Xpβ˚22n`
pβ ´ pβ˚22 — pmnq logppmq — psnq logppsq.
This matches the minimax prediction and `2 estimation rate of the Slope in Bellec et al. [4]
As an extension of Theorem 2.1 (i), Theorem 2.2 provides prediction and estimation
error bounds in the same form for smaller penalty levels with somewhat smaller RCIF
and RE. However, the approach does not provide a full extension of Theorem 2.1 in several
aspects. Due to the use of the `2 norm in condition (2.32), the `q estimation error bound can
be extended only for 1 ď q ď 2, and the compatibility coefficient cannot be used to bound
the prediction and `1 errors. In addition, solutions of (2.12) are not selection consistent at
the smaller penalty level due to a high likelihood of some false positive selection.
24
We have considered so far solutions of (2.12) in the main branch of the solution space
B0pλ˚, κ˚q in (2.14). Such solutions are computable by path finding algorithms. In fact, as
discussed below Proposition 2.2, our analysis is also applicable if ppβ´βoqλ is connected to a
cone through a discrete sequence of such normalized errors in small `1 increments. Statistical
and computational properties of iterative discrete solution paths have been studied in [80]
among others. However, compared with Theorems 2.1 and 2.2, [80] requires upper sparse
eigenvalue conditions on X and larger penalty levels satisfying (2.16).
Proof of Theorem 2.2. Let h “ pβ ´ βo. Recall that Lemma 2.1 gives
hTΣh ďř
jPScpzjhj ´ λ|hj |q ´ λωTShS `
ř
jκpβoj ; ρj , λqh
2j
“ hTz ´ λhSc1 ´ř
jPShj 9ρjpβoj ;λq `
ř
jκpβoj ; ρj , λqh
2j , (2.43)
where z “ XT py ´Xβoqn, wj “ 9ρjpβoj ;λqλ ´ zjλ, κpt; ρj , λq is as in (2.3). Let ε1 “
min
η ´ zSc8λ˚, η1 ´ p|zS | ´ η0λ˚q`2pm
12λ˚q(
and T Ě supppzq. By (2.34), ε1 ą 0
and
|hTz| ď pη ´ ε1qλ˚hT zS1 ` pη0 ´ ε1qλ˚hS1
`ř
jPS |hj |p|zj | ´ pη0 ´ ε1qλ˚q` (2.44)
ď ηλ˚hT zS1 ` η0λ˚hS1 ´ ε1λ˚hT 1
`
´
p|zS | ´ η0λ˚q`2 ` ε1m12λ˚
¯
hS2.
ď pη ´ ε1qλ˚hT 1 ` η1λ˚`
m12hS2 ´ hS1˘
.
Let u “ hλ with λ ě λ˚. Combining (2.43) and (2.44), we have
uTΣu` ε1u1 ` p1´ ηquSc1 (2.45)
ď p1` ηquS1 ` η1
`
m12uS2 ´ uS1˘
` κ˚u22.
The above inequality holds for all u “ ppβ´βoqλ as long as pβ P Bpλ˚, κ˚q. As in the proof
of Proposition 2.2, the condition ĎRE22pS; η0, η1,mq ě κ˚ and the above inequality imply
25
that for all such u,
p1´ ηquSc1 ď p1` ηquS1 ` η1
`
m12uS2 ´ uS1˘
,
so that u P U pS, η0, η1,mq.
By the definition of ĞRCIF, we have
ĞRCIFpredpS; η0, η1,mqhTΣh ď Σh2c |S|, (2.46)
and
ĞRCIFest,qpS; η0, η1,mqhq ď Σhc|S|1q. (2.47)
Moreover, we have
pΣhqSc8 ď XTScpy ´Xpβqn8 ` X
TScpy ´Xβ
oqn8 ď p1` ηqλ,
and
pΣhqS2 ď XTS py ´X
pβqn2 ` XTS py ´X
pβqn2
ď λm12 ` η0λ˚m12 ` η1m
12λ˚
ď p1` ηqm12λ.
Thus,
Σhc “ max
pΣhqSc8, pΣhqS2m12
(
ď p1` ηqλ. (2.48)
We establish the RCIF error bounds in (2.41) and (2.42) by inserting the above inequality
into (2.46) and (2.47) respectively.
To compare the RCIF and RE, we note that
p1´ ηqu1 ď 2|S|12uS2 ` η1m12uS2.
26
for u P U pS; η0, η1,mq, so that for η ă 1
uTΣu “ uTScpΣuqSc ` uTS pΣuqS
ď uSc1Σuc `m12uS2Σuc
ď
´2|S|12uS2 ` η1m12uS2
1´ η`m12uS2
¯
Σuc
ď2|S|12 ` p1´ η0qm
12
1´ ηu2Σuc.
It follows that
uTΣu
u22ď
”2p|S|s˚q12 ` p1´ η0qpms˚q12
1´ η
ı2 Σu2cs˚
uTΣu,
and
uTΣu
u22ď
”2p|S|s˚q12 ` p1´ η0qpms˚q12
1´ η
ı
Σucps˚q12
u2.
Taking infimum in the cone U pS; η0, η1,mq on both sides and noting that ξ1 “
“
2p|S|s˚q12 ` p1´ η0qpms˚q12
‰
`
1´ η˘
, we obtain
ĞRCIFpredpS; η0, η1,mq ě ĎRE22pS; η0, η1,mqξ
21 ,
and
ĞRCIFest,2pS; η0, η1,mq ě ĎRE22pS; η0, η1,mqξ1.
This completes the proof. ˝
2.4 Scaled concave PLSE
We have studied in previous sections the properties of all the local solutions in B0pλ˚, κ˚q.
Suppose that the local solution set B0pλ˚, κ˚q is obtained, one still needs to choose an
appropriate solution in the set or a proper penalty level. This problem, which will be
studied in this section, is essentially to estimate the noise level σ due to scale invariance.
27
Numerous efforts have been devoted to scale free estimation under the `1 penalty. Stadler
et al. [67] proposed the minimizer of joint log-likelihood of regression coefficients and noise
level with an `1 penalty. The comment on this paper by Antoniadis [2] pointed out that
their estimator is equivalent to the joint minimization of Huber’s concomitant loss
ppβ, pσq “ arg minβ,σ
y ´Xβ222nσ
`σ
2` λ0β1. (2.49)
It turns out that (2.49) coincides with many other works on the scale free estimation under
the `1 penalty. For example, the square-root Lasso solution [5] and the equilibrium of the
iterative algorithm [69] are both equivalent to (2.49). However, all of these studies of the
scale free estimation are limited to the `1 penalty. The scaled concave PLSE is not an easy
extension of the scaled `1 penalization due to the loss of scale free property.
In fact, the concomitant loss or the square-root formulation fail for concave penalties.
To illustrate this, we take the MCP as an example. Denote σ˚ “ y ´Xβ˚2n12 as the
oracle noise level estimator given the true coefficients β˚. Under Gaussian assumption, this
is the maximum likelihood estimator for σ when β˚ is known and thus a natural estimation
target. For Minimax concave penalty ρpt, λq “ λş|t|0 p1´ xpλγqq`dx,
pσ2 “ arg minσ2
y ´Xβ˚222nσ
`σ
2`
1
σ
pÿ
j“1
ρp|β˚j |;λ0σq
“ tσ˚u2 ´ p1γqÿ
jPt|β˚j |ăλ0pσγu
tβ˚j u2,
pσ is expected to underestimate σ˚ unless there is no small β˚j such that t|β˚j | ă λ0pσγu.
This validates the argument that concomitant loss formulation fails for concave penalties.
In addition, the iterative algorithm becomes extremely difficult to analyze due to the loss
of its equivalence to joint convex minimization, compared with the Lasso.
2.4.1 Description of the scaled concave PLSE
Given a coefficients pβ, define the noise level estimator as
pσppβq “ y ´Xpβ2tn´ du12, (2.50)
28
where d is a parameter provides an option to adjust degrees-of-freedom. Typically, we let
d “ p when p ă n and d “ 0 otherwise. Within the local solution set B0pλ˚, κ˚q, we search
for a subclass of scaled concave penalized least-square estimators B0,scalpλ0;λ˚, κ˚q, defined
as
B0,scalpλ0;λ˚, κ˚q “!
pβ P B0pλ˚, κ˚q : λ0pσppβq “ λ)
. (2.51)
Here, λ0 is a prefixed penalty level and independent of σ. For example, one may choose
λ0 “ Atp2nq log pu12 for universal penalty and λ0 “ An´12L1pkpq for smaller penalty,
with appropriate k and A. We derive the consistency results for noise level estimation for
different λ0 separately in the following analysis.
As discussed in Section 2.2, B0pλ˚, κ˚q is a large class of estimators that includes all
local solutions connected to the origin regardless of the specific algorithms used to compute
the solution. We here use the PLUS algorithm as an example to illustrate the computation
of the estimators in B0,scalpλ0;λ˚, κ˚q. The PLUS, indexed by x, is defined as
λpxq ‘ pβpxq”
$
’
’
&
’
’
%
a continuous path of solutions of (2.12) in R1`p
with pβp0q“ 0 and limxÑ8 λ
pxq “ 0.
(2.52)
Given a PLUS solution path, the scaled estimator can be defined as
pβscal
“ pβppxq, px “ mintx : λ0pσppβ
pxqq ě λpxqu. (2.53)
The “ ě ” in defining px in (2.53) can be changed to “ “ ” by the continuity of the PLUS
path. Under mild regularity conditions,we will prove that pβscal
P B0,scalpλ0;λ˚, κ˚q. See
next subsections for the proof. This also guarantees the non-emptiness of B0,scalpλ0;λ˚, κ˚q.
2.4.2 Performance guarantees of scaled concave PLSE at universal
penalty levels
In this subsection, we derive the consistency results for noise level estimation with
sufficiently large λ0. Since σ˚ “ y´Xβ˚2n12 is a natural target of noise level estimation,
29
we aim to derive the convergence results of pσppβqσ˚ with pβ P B0,scalpλ0;λ˚, κ˚q in the
following theorem.
Theorem 2.3. Let β˚ be the true regression coefficients, pβscal
be in (2.53) and σ˚ “
y´Xβ˚2n12 be the oracle noise level estimator. Let 0 ă η ă 1 and ξ “ p1` ηqp1´ ηq.
Suppose κpρq ď κ˚ and RE22pS; η, 1q ě κ˚.
piq Let τ0 “ p1 ` ξqp1 ` ηqλ0s12RE1pS; η, 1q. When (2.16) holds with λ˚ “ λ0σ
˚p1 ´ τ0q
and δ˚ “ 1, we have pβscal
P B0,scalpλ0;λ˚, κ˚q. Moreover, for any pβ P B0,scalpλ0;λ˚, κ˚q,
max
˜
1´pσppβq
σ˚, 1´
σ˚
pσppβq
¸
ď τ0,Xpβ ´Xβ˚2
n12σ˚ď
τ0
1´ τ0. (2.54)
In particular, if we take λ0 “ Atp2nq log pu12 with A ą 1η and τ0 Ñ 0, then for all ε ą 0
Pβ˚,σp|pσppβqσ ´ 1| ą εq Ñ 0. (2.55)
piiq Let τ2˚ “ ηp1`ηqp1`ξq2λ2
0sRE1pS; η, 1q. When (2.16) holds with λ˚ “ λ0σ˚p1´τ2
˚q
and δ˚ “ 1, we have pβscal
P B0,scalpλ0;λ˚, κ˚q. Moreover, for any pβ P B0,scalpλ0;λ˚, κ˚q,
max
˜
1´pσppβq
σ˚, 1´
σ˚
pσppβq
¸
ď 3τ2˚ . (2.56)
If we take λ0 “ Atp2nq log pu12 with A ą 1η and τ2˚ ! n´12, then
n12ppσppβqσ ´ 1q Ñ N p0, 12q (2.57)
in distribution under Pβ˚,σ.
By proving pβscal
P B0,scalpλ0;λ˚, κ˚q, Theorem 2.3 guarantees the non-emptiness of
B0,scalpλ0;λ˚, κ˚q with appropriate λ˚ and λ0. Moreover, it provides the convergence
and asymptotic normality for the scaled concave estimation of noise level under only the
restricted eigenvalue conditions. In part (i), we achieve an error rate τ0 —a
psnq log p for
noise level estimation with pβ P B0,scalpλ0;λ˚, κ˚q. This matches the `1 penalized maximum
likelihood estimator in [67]. In part (ii), we provide sharper convergence rate and the
asymptotic normality results. The sharper rate τ2˚ is on the order of psnq log p, which
30
essentially taking the square of the order in part (i). The asymptotic normality then follows
from the sharper rate under mild assumptions. The convergence rate in part (ii) matches
the rate of iterative algorithm formulation in Sun and Zhang [69].
Proof of Theorem 2.3. First prove (i). Denote z “ XT py ´ Xβ˚qn and hpxq “
pβpxq´ β˚. Consider penalty level λpx0q “ λ˚ “ λ0σ
˚p1 ´ τ0q for certain x0 in the PLUS
path. Since λpx0q “ λ˚ satisfies (2.16), it follows from Theorem 2.1 and the definition of τ0
that Xhpx0q2n12 ď σ˚τ0p1´ τ0q ď σ˚τ0. Then we have
λ0pσppβpx0qq “ λ0y ´Xpβ
px0q2n
12
ě λ0
ˇ
ˇ
ˇσ˚ ´ Xhpx0q2n
12ˇ
ˇ
ˇě λ0σ
˚p1´ τ0q “ λpx0q. (2.58)
By the definition of px, px ď x0. Since any penalty level λpxq ě λ˚ is a local solution
of (2.12) in the PLUS path, λpxq is a non-increasing function of x for λpxq ě λ˚. Thus,
λppxq ě λpx0q “ λ˚. It follows that pβscal
P B0,scalpλ0;λ˚, κ˚q.
Moreover, for any pβ P B0,scalpλ0;λ˚, κ˚q, with penalty λ ě λ˚, we have
pσppβq “ λλ0 ě λ˚λ0 “ σ˚p1´ τ0q. (2.59)
Furthermore, by Theorem 2.1 we have
ˇ
ˇ
ˇy ´Xpβ2n
12 ´ σ˚ˇ
ˇ
ˇď Xppβ ´ β˚q2n
12 ď τ0pσppβq. (2.60)
Thus,
pσppβq
σ˚“y ´Xpβ2
n12σ˚ďτ0pσppβq ` σ
˚
σ˚“ 1` τ0
pσppβq
σ˚, (2.61)
This implies pσppβq ď σ˚p1 ´ τ0q. Combing with (2.59), the first part of (2.54) holds. In
addition,
Xpβ ´Xβ˚2n12 ď τ0pσppβq ď σ˚τ0p1´ τ0q. (2.62)
31
The second part of (2.54) holds. To prove (3.2), since for certain A,
Pβ,σ”
z8 ď Aσtp2nq log pu12ı
Ñ 1,
we have (3.2) follows from (2.54).
Now we prove (ii). By the KKT condition,
´pz8 ` λqh1 ď pXhqT!
y ´Xβ˚ ` y ´Xpβ)
n
ď pσ˚q2 ´ y ´Xpβ22n
ď pXhqT t2py ´Xβ˚q ´Xhun ď 2z8h1. (2.63)
We use above inequalities as lower and upper bounds for pσ˚q2 ´ y ´Xpβ22n.
Consider λpx1q “ λ˚ “ λ0σ˚p1´ τ2
˚q in the PLUS path. Since λpx1q “ λ˚ satisfies (2.16),
it follows from Theorem 2.1 that hpx1q1 ď p1` ξq2p1` ηqλpx1qsRE1pS; η, 1q. Combining
with z8 ă λ0ησ˚p1´ τ2
˚q, we have
λ20 pσ2ppβ
px1qq “ λ2
0 y ´Xpβpx1q22n
ě λ20
´
tσ˚u2 ´ 2z8hpx1q1
¯
ě λ20tσ
˚u2`
1´ 2τ2˚p1´ τ
2˚q
2˘
ě λ20tσ
˚u2p1´ τ2˚q
2 “ pλpx1qq2. (2.64)
The last inequality holds since τ2˚ ď 1. As in part (i), we find λppxq ě λpx1q “ λ˚ and
pβscal
P B0,scalpλ0;λ˚, κ˚q.
Similarly, for any pβ P B0,scalpλ0;λ˚, κ˚q with penalty λ ě λ˚ “ λ0σ˚p1 ´ τ2
˚q, we have
pσppβq “ λλ0 ě λ˚λ0 “ σ˚p1 ´ τ2˚q. On the other hand, recall that z8 ă λ0ησ
˚p1 ´ τ2˚q
and pβ ´ β˚1 ď p1` ξq2p1` ηqλ0pσppβqsRE1pS; η, 1q, we have
pσppβq2
tσ˚u2“
y ´Xpβ22ntσ˚u2
ď
tσ˚u2 `´
z8 ` λ0pσppβq¯
pβ ´ β˚1
tσ˚u2
ďtσ˚u2 ` τ2
˚p1´ τ2˚qpσp
pβqσ˚ ` p1ηqτ2˚pσ
2ppβq
tσ˚u2. (2.65)
32
Solving above equation w.r.t pσppβqσ˚, we obtain pσppβqσ˚ ď p1 ` τ2˚qp1 ´ τ2
˚q. Thus 1 ´
σ˚pσppβq ď 3τ2˚ . This proves (3.3). Given (3.3), the proof of (2.57) follows the proof of
Theorem 2 (ii) in Sun and Zhang [69]. ˝
2.4.3 Performance bounds of scaled concave PLSE at smaller penalty
levels
In this subsection, we derive the consistency results for noise level estimation with smaller
λ0.
Theorem 2.4. Let β˚, pβscal
and σ˚ be as in Theorem 2.3 and ĎRE˚,2pS; ¨, ¨q be as in (2.39).
Let m be a positive integer, η “ η0`η1 with positive η0, η1 and ξ2 “ r2`p1´η0qpmsq12sp1´
ηq. Define τ1 “ p1`ηqλ0ξ2ps_mq12ĎRE˚,2pS; η,mq. Suppose κpρq ď κ˚ and RE2
2pS; η, 1q ě
κ˚. When (2.32) holds with λ˚ “ λ0σ˚p1´τ1q, we have pβ
scalP B0,scalpλ0;λ˚, κ˚q. Moreover,
for any pβ P B0,scalpλ0;λ˚, κ˚q,
max
˜
1´pσppβq
σ˚, 1´
σ˚
pσppβq
¸
ď τ1,Xpβ ´Xβ˚2
n12σ˚ď
τ1
1´ τ1. (2.66)
If we take λ0 “ An´12L1pkpq with k in (2.37), A ą 1η0 and τ1 Ñ 0, then for all ε ą 0
Pβ˚,σp|pσppβqσ ´ 1| ą εq Ñ 0. (2.67)
Similar as Theorem 2.3, Theorem 2.4 first guarantees the non-emptiness of
B0,scalpλ0;λ˚, κ˚q but with smaller λ0. Furthermore, it provides the convergence results
for noise level estimation at smaller penalties with nearly identical condition as in Theorem
2.3. Compared with existing literatures, Theorem 2.4 could be viewed as a generalization
of scaled Lasso with smaller penalties in Sun and Zhang [70].
Proof of Theorem 2.4. Consider penalty level λpx1q “ λ˚ “ λ0σ˚p1 ´ τ1q for certain
x1 ă 8 in the PLUS path. Since (2.32) holds for λpx1q, by Theorem 2.2, we have
Xhpx1q2n12 ď
p1` ηqξ1λps˚q12
ĎRE2pS; η0, η1,mqďp1` ηqξ2λ
px1qps_mq12
ĎRE˚,2pS; η,mqď σ˚τ1.
33
Similar as (2.58),
λ0pσppβpx1qq “ λ0y ´Xpβ
px1q2n
12 ě λ0σ˚p1´ τ1q “ λpx1q.
As in the proof of Theorem 2.3, we find λppxq ě λpx1q and pβscal
P B0,scalpλ0;λ˚, κ˚q.
Moreover, (2.66) and (2.67) can be proved in the same way as Theorem 2.3. ˝
2.5 Simulation Study
In this section, we report the noise level estimation results of the scaled concave PLSE
and compare with several competing methods in a comprehensive simulation study. The
experimental settings follow Reid et al. [61] and are described with our notation as below.
The simulation aims to estimate noise level σ in a variety of settings. All simulations
are run at a sample size of n “ 100, the number of predictors is considered in four different
values: p “ 100, 200, 500, 1000. Elements of the design matrix X are generated randomly
as Xij „ N p0, 1q. Correlation between columns of X is set to be ρ. The true parameter
β˚ is generated as follows: the number of nonzero elements is set to be pnz “ rnαs, i.e., α
controls the sparsity of β˚: the higher the α; the less sparse of β˚. It ranges between 0 and
1. The indices corresponding to nonzero β˚ are selected randomly. Their value are set to
be random samples from a Laplacep0, 1q distribution. The elements of the resulting β˚ is
scaled such that the signal-to-noise ratio, defined as tβ˚uTΣβ˚L
σ2 is some predetermined
value, snr. Simulations were run over a grid of values for each of the parameters described
above. In particular,
• ρ “ 0, 0.2, 0.4, 0.6, 0.8
• α “ 0.1, 0.3, 0.5, 0.7, 0.9
• snr “ 0.5, 1, 2, 5, 10, 20.
We simulate B “ 200 independent datasets for each set of parameters. The competing
methods considered include:
34
• The oracle estimator given the true coefficients β˚,
pσ2o “
1
n
nÿ
i“1
pyi ´XTi β
˚q2.
• The cross-validation based Lasso, denoted as CV L,
pσ2CV L “
1
n´ pspλCV L
nÿ
i“1
pyi ´XTipβpλCV L
q,
where pλCV L is selected by 10-fold cross validation and pspλCV L
“ pβpλCV L
0.
• The cross-validation based SCAD, denoted as CV SCAD,
pσ2SCAD “
1
n´ pspλSCAD
nÿ
i“1
pyi ´XTipβpλSCAD
q2,
where pλSCAD is selected by 10-fold cross validation and pspλSCAD
“ pβpλSCAD
0.
• Scaled Lasso in Sun and Zhang [69], with λ0 “a
p2nq log p, denoted as SZ L.
• Scaled Lasso with smaller smaller penalty level, λ0 “ p2nq12L1pkpq with k in (2.37),
denoted as SZ L2.
• Scaled MCP in (2.53) with universal penalty λ0 “a
p2nq log p, denoted as SZ MCP .
• Scaled MCP in (2.53) with smaller penalty λ0 “ p2nq12L1pkpq with k in (2.37),
denoted as SZ MCP2.
• Scaled MCP in (2.53), with adaptive penalty and denoted as SZ MCP3, is defined
as follows: We generate an error vector ε1 „ Np0, Inq and compute z “XTε1n. We
then order |z| and pick the k1 “ rks largest elements of |z|, denoted as |z|1, ...|z|k1 ,
with k in (2.37). Then, let λ1 “ n´12L1pk1pq. Finally, define Λ0 as
Λ0 “ λ1 `! 1
k1
k1ÿ
j“1
p|z|j ´ λ1q2`
)12
We repeat this procedure for 500 times, and take λ0 equal to the median value of all
the computed Λ0s.
35
The true noise level/standard deviation is set to be σ “ 1 in all simulations. For the
concave penalties, we use default concavity γ in corresponding R packages. Specifically, we
use R package glmnet for the Lasso estimates, package ncvreg for SCAD estimates and
R package plus for MCP estimates.
2.5.1 No signal case: β˚ “ 0
We first consider the case where α “ ´8, forcing β˚ “ 0. It is obvious that snr is irrelevant
here, since there is no signal.
Methods p=100 p=200 p=500 p=1000
Oracle 0.0034 -0.0059 0.0019 -0.0064CV L -0.0256 -0.0215 -0.0340 -0.0510CV SCAD -0.0163 -0.0187 -0.0304 -0.0481SZ L 0.0004 -0.0141 -0.0093 -0.0078SZ L2 -0.0430 -0.0443 -0.0509 -0.0519SZ MCP -0.0018 -0.0181 -0.0098 -0.0103SZ MCP2 -0.0669 -0.0638 -0.0913 -0.0816SZ MCP3 -0.0344 -0.0513 -0.0913 -0.112
Table 2.1: Median bias of standard deviation estimates. No signal, σ “ 1, ρ “ 0, samplesize n “ 100. Minimum error besides the oracle is in bold for each analysis.
Table 1 shows the median bias of different standard deviation estimates when no signal
exists, with σ “ 1, ρ “ 0, n “ 100. Besides the oracle estimator, it is clear that the scaled
estimators with universal penalty SZ L and SZ MCP perform best in each analysis. All
other estimators are slightly downward biased. One possible reason is that the estimators
with larger penalty, e.g., SZ L or SZ MCP , tend to select fewer variables. When there
is no signal, choosing fewer variables may lead to better noise level estimation accuracy.
Comparatively, the scaled estimators with smaller penalty or the cross-validation based
estimators tend to include more variables. This may lead to the underestimation of noise
level.
2.5.2 Effect of correlation: ranging over different ρ
Now we consider the setting where correlation ρ of the design matrix ranges from 0 to
0.8. Figure 1 plots the median standard deviation estimates over different ρ, with fixed
36
0.0 0.2 0.4 0.6 0.8
0.9
1.0
1.1
1.2
1.3
correlation
1 1 1 1 12 2 22 2
3 3 3
33
44
4 4 45 5 5
5 5
6 6
6 6
6
7 77 7
7
0.0 0.2 0.4 0.6 0.8
0.9
1.0
1.1
1.2
1.3
correlation
1 11 1 1
2 2 22 2
33
3
3
34
4
44
4
55 5 5
5
6 6
6 66
77 7 7
7
0.0 0.2 0.4 0.6 0.8
0.9
1.0
1.1
1.2
1.3
correlation
11
11 1
22 2 2 2
33
33
34 44
4
4
5
55
5
5
6
6 66
6
77
7 7
7
0.0 0.2 0.4 0.6 0.80.
91.
01.
11.
21.
3
correlation
11
11
12
22
2
2
33
33
3
4 4 4 4
4
55
55
56 6
6 66
77
77 7
Figure 2.1: Median standard deviation estimates over different levels of predictorcorrelation. σ “ 1, α “ 0.5, snr “ 1, sample size n “ 100, predictors p “ 100, 200, 500, 1000moving from left to right along rows. Plot number refer to CV L(1), CV SCAD(2),SZ L(3), SZ L2(4), SZ MCP (5), SZ MCP2(6), SZ MCP3(7).
α “ 0.5, snr “ 1. One clear trend is that the correlation between predictors becomes
the rescue of many variance estimators. This observation agrees with the results in Reid
et al. [61]. Among all estimators, scaled MCP with adaptive penalty SZ MCP3 performs
consistently well, even better than the cross-validation based methods in a large range of
correlations. This is because that SZ MCP3 is data dependent and count the correlation
between predictors. Comparatively, when correlation ρ goes large, even the scaled MCP with
smaller penalty SZ MCP2 chooses too large penalty, which then degrades its performance.
2.5.3 Effect of signal-to-noise ratio: ranging over different snr
We then consider the setting where signal-to-noise ratio ranges from 0.5 to 20. Figure 2 plots
the median standard deviation estimates over different levels of snr, with fixed α “ 0.5,
ρ “ 0. Similar with previous subsection, the performance of SZ MCP3 remains in the
37
0 5 10 15 20
1.0
1.2
1.4
1.6
1.8
2.0
2.2
signal−to−noise ratio
1 1 1 1 1 12 2 2 2 2 2
33
33
3
3
4 4 44 4 45 5 5 5
55
6 6 6 6 6 67 7 7 7 7 7
0 5 10 15 20
1.0
1.2
1.4
1.6
1.8
2.0
2.2
signal−to−noise ratio
11 1 1 1 122 2 2 2 2
3 33
3
3
3
4 4 4 4 44
5 5 55 5 5
6 6 6 6 6 67 7 7 7 7 7
0 5 10 15 20
1.0
1.2
1.4
1.6
1.8
2.0
2.2
signal−to−noise ratio
1 1 1 11
12 2 2 2 2 2
33
3
3
33
4 44 4
4 4
55 5 5
5 5
6 6 6 6 6 67 7 7 7 7 7
0 5 10 15 201.
01.
21.
41.
61.
82.
02.
2
signal−to−noise ratio
1 1 1 1 1 12
2 2 2 2 2
33
3
3
3
3
4 44
44
4
5 5 5 55 5
6 6 6 6 6 67 7 7 7 7 7
Figure 2.2: Median standard deviation estimates over different levels of signal-to-noise level.σ “ 1, α “ 0.5, ρ “ 0, sample size n “ 100, predictors p “ 100, 200, 500, 1000 moving fromleft to right along rows. Plot number refer to CV L(1), CV SCAD(2), SZ L(3), SZ L2(4),SZ MCP (5), SZ MCP2(6), SZ MCP3(7).
top tier with CV L and CV SCAD in a large range of snr. Besides, the scaled MCP
with smaller penalty (SZ MCP2) also achieves very competitive estimation accuracy. On
the contrary, the performance of two scaled Lasso estimators (SZ L and SZ L2) degraded
significantly as the snr increase. We believe that the reason lies on the intrinsic bias of the
Lasso. Indeed, with fixed sparsity level, the higher the snr, the larger the per-element signal
strength. The impact of the bias of the Lasso becomes increasingly severe as individual
signal size goes large.
2.5.4 Effect of sparsity: ranging over different α
We finally consider the setting where sparsity level α ranges from 0.1 to 0.9. Figure 1 plots
the median standard deviation estimates over different α, with fixed ρ “ 0, snr “ 1. It is
clear that each of the estimators shows an upward bias trend. This trend appears because
38
0.2 0.4 0.6 0.8
0.9
1.0
1.1
1.2
1.3
1.4
sparsity
1 1 11
1
2 22
2
2
3
3
3
3
3
44
4
4
4
55
5
5
5
6 66
6
6
7 77
7
7
0.2 0.4 0.6 0.8
0.9
1.0
1.1
1.2
1.3
1.4
sparsity
11
1 1
1
2 22
2
2
33
3
3
3
44
4
4
4
55
5
5
5
66
6
6
6
77
7
7
7
0.2 0.4 0.6 0.8
0.9
1.0
1.1
1.2
1.3
1.4
sparsity
11 1
1
1
2 2
2
2
2
3
3
33
3
44
4
4
4
5
5
5
5
5
6 6
6
6
6
7 7
7
7
7
0.2 0.4 0.6 0.80.
91.
01.
11.
21.
31.
4
sparsity
1 1
1
1
1
2 22
2
2
3
3
3
33
4 4
4
4
4
5
5
5
5
5
66
6
6
6
77
7
7
7
Figure 2.3: Median standard deviation estimates over different levels of sparsity. σ “ 1,snr “ 1, ρ “ 0, sample size n “ 100, predictors p “ 100, 200, 500, 1000 moving from leftto right along rows. Plot number refer to CV L(1), CV SCAD(2), SZ L(3), SZ L2(4),SZ MCP (5), SZ MCP2(6), SZ MCP3(7).
no estimator successfully select all important variables when more and more variables come
into the model. However, SZ MCP2, SZ MCP3 along with CV L and CV SCAD are
four estimators that perform stably over a large range of sparsities. Indeed, while the
superiority of cross-validation based methods has been demonstrated by Reid et al. [61],
SZ MCP2 and SZ MCP3 also proved their robustness toward different sparsities in our
experiments.
After all, we conclude that the scaled MCP with adaptive penalty SZ MCP3 performs
consistently well in most settings, regardless of sparsity, signal-to-noise ratio or the design
correlations. The performance of scaled MCP with smaller penalty SZ MCP2 is also
notable besides the case where the design matrix is highly correlated. Comparatively,
SZ MCP is not very competitive in several settings. This confirms again that a universal
penalty for scaled estimators may be too large. More intuitions behind the choice of λ0 will
be discussed in section 2.6.
39
2.6 Discussion
In this chapter, we developed a new theory of concave penalized least-square estimator
and its scaled version under much weakened conditions. We prove that the concave PLSE
matches the oracle properties of prediction and coefficients estimation of the Lasso based
only on the RE-type conditions, one of the mildest conditions on the design matrix.
Moreover, to achieve selection consistency, our theorem does not require any additional
conditions for proper concave penalties such as the SCAD penalty and MCP. Furthermore,
the scaled version of the concave PLSE provides consistency and asymptotic normality for
noise level estimation. A comprehensive simulation study of variance estimation under
different levels of sparsity, signal-to-noise ratio and design correlation demonstrated the
superior performance of the scaled concave PLSE.
0 10 20 30 40
0.20
0.25
0.30
0.35
0.40
0.45
0.50
k
λ 0
1 1 1 1 1 1
k2 k1
2
22
22 2
3
3
33
33
4
44
44 4
5
5
55
5 5
Figure 2.4: Five λ0s as functions k, n=100, p=1000. Line numbers refer to (1) λ0pkq “tp2nq logppqu12, (2) λ0pkq “ tp2nq logppkqu12, (3) λ0pkq “ p2nq
12L1pkpq, (4) Adaptiveλ0 described in section 2.5 with various k, assuming that the correlation between columnsof X is 0. (5) Same as (4) except assuming that the correlation between columns of X is0.8. The k1 is the solution to (2.37), k2 is the solution to 2k “ L4
1pkpq ` 2L21pkpq.
In the simulation study, we considered three different λ0s to compute the scaled MCP.
All three λ0s are independent with the true sparsity level s “ β˚0. On the other
hand, the theoretical choice of λ0 for universal and smaller penalties are respectively
tp2nq logppsqu12 and p2nq12L1pspq. One in fact need to find some replacements, named
40
k, for s. We plot different λ0s as functions of k in Figure 4 to find some more intuitions for
the choice of λ0.
It is clear from Figure 4 that the constant λ0 “ tp2nq logppqu12 (line 1) over-estimates
λ0 “ tp2nq logppkqu12 (line 2) in a significant amount. This partially explains the
impaired performance of SZ MCP1 and SZ L1 in the simulations. When the columns of
design matrix are uncorrelated, the adaptive λ0 (line 4) is quite close to the theoretical chose
of universal and smaller λ0 (line 2 and line 3) in a large range of k. On the other hand, when
the columns of design matrix are highly correlated, the adaptive λ0 (line 5) is clearly way
below line 2 and line 4. This explains the superior performance of SZ MCP3 compared to
SZ MCP2 for highly correlated design. Moreover, Sun and Zhang [70] proposed to estimate
k by solving C0k “ L41pkpq ` 2L2
1pkpq with C0 being certain constant. We take C0 “ 1
in the simulations by the suggestion of Sun and Zhang [70]. Different C0, (e.g. C0 “ 2,
corresponds to k2 in Figure 4) was also tried, but no dramatic change on the accuracy of
noise level estimation.
41
Chapter 3
Penalized least-square estimation with noisy and missing
data
3.1 Introduction
In this chapter, we consider the high-dimensional linear model where the design matrix
subject to disturbance. Two types of disturbance will be discussed in this chapter:
(i) Covariates with noise: We observe Zij “ Xij `Wij , where W i “ pWi1, ...,WipqT is a
random vector with mean 0 and known covariance matrix Σw.
(ii) Covariates with missingness: Let ηij “ 1 if Xij is observed, ηij “ 0 otherwise. Define
Zij “ Xijηij , then the random matrix Z P Rnˆp is observed. Denote the probability
that Xij is observed as π. We suppose that ηij follows Bernullip1 ´ πq distribution
independently for i “ 1, ..., n, j “ 1, ..., p.
We assume the target coefficients vector is s-sparse (i.e., β˚0 ď s) and allow the
number of coefficients p to be much larger than the sample size n. When the design matrices
are fully observed, i.e., π “ 0 and Σw “ 0, the problem has been addressed using penalized
least square estimation (PLSE) in chapter 2. The PLSE for fully observed design, including
the Lasso [71], SCAD [21] and MCP [84], can all be viewed as the form
pβ P arg minβ
#
1
2βTΣβ ´ βTz `
pÿ
j“1
ρp|βj |;λq
+
(3.1)
with Σ “ XTXn, z “ XTyn and ρp¨;λq being the penalty function indexed by penalty
level λ.
When the designs are subject to disturbance, the program (3.1) cannot be implemented
due to unobtainable Σ and z. On the other hand, observe that Σ and z serve as the natural
42
replacements of their unobserved population counterparts Σx “ EXTXn and Σxβ˚, Loh
and Wainwright [44] proposed to form other estimates, say pΓ and pγ, of Σx and Σxβ˚. The
program (3.1) then becomes
pβ P arg minβ
#
1
2βT pΓβ ´ βTpγ `
pÿ
j“1
ρp|βj |;λq
+
. (3.2)
With different types of disturbance, ppΓ, pγq may take different forms. For example, when
covariates are subject to noise, ppΓ, pγq can be taken as
ppΓ, pγq “ ppΓnoi, pγnoiq “
ˆ
1
nZTZ ´Σw,
1
nZTy
˙
. (3.3)
When covariates are subject to missingness, ppΓ, pγq can be taken as
ppΓ, pγq “´
pΓmis, pγmis
¯
“
˜
rZTrZ
n´ π diag
`
rZTrZ
n
˘
,1
nrZTy
¸
, (3.4)
where rZij “ Zijp1 ´ πq. It is easy to show that both ppΓnoi, pγnoiq and ppΓmis, pγmisq are
unbiased estimates of pΣx,Σxβ˚q. Moreover, they reduce to pΣ, zq when noise covariance
matrix Σw “ 0 or missing probability π “ 0.
One major issue of the optimization program (3.2) is its non-convexity. Indeed, even
for the convex `1 penalty, (3.2) may still be a non-convex optimization problem and has
multiple local solutions. This is because that neither pΓnoi nor pΓmis is positive semi-definite.
Substantial effort has been made to overcome this technique issue. Loh and Wainwright
[44] inserted a side constraint on the optimization problem and proposed the estimator,
pβ P arg minβ1ďb0
?s
#
1
2βT pΓβ ´ βTpγ `
pÿ
j“1
λ|βj |
+
. (3.5)
with b0 being a constant. Loh and Wainwright proved that both statistical and optimization
error bounds of coefficients estimation can be guaranteed with a properly chosen b0 and some
restricted eigenvalue conditions. Indeed, a properly chosen b0 is critical for their results.
The b0 cannot be too small since they require b0 ě β2, where β is the unknown true
coefficients. On the other hand, the b0 cannot be too large due to the imposed restricted
43
eigenvalue conditions. Datta and Zou [14] further proposed to approximate (3.2) by a
convex objective function with the Lasso penalty and named CoCoLasso. The CoCoLasso
first aims to find a nearest semi-positive definite matrix to the non positive semi-definite
matrix pΓ,
pΓ` “ arg minKě0
K ´ pΓ8, (3.6)
and then estimate pβ with
pβ P arg minβ
#
1
2βT pΓ`β ´ β
Tpγ `
pÿ
j“1
λ|βj |
+
. (3.7)
Besides the penalized least square type estimators, we also note that the Dantzig Selector
[11] type estimator has also been proposed to deal with the disturbance-in-design problem.
Rosenbaum et al. [64, 65] proposed the matrix-uncertainty (MU) selector and its improved
version and proved their coefficients estimation property. The MU selector, denoted as our
notation, is
pβ P!
β1 P Rp : pγ ´ pΓβ8 ď µβ1 ` τ)
(3.8)
where µ ě 0, τ ě 0 are pre-specified constants. Note that the feasible set of the MU-selector
was formed by using the `1-norm of regression coefficients to bound pγ´ pΓβ8. We believe
that this may not be optimal as an `2-norm bound maybe sufficient. A detailed reasoning
may be seen in Section 4.
In this chapter, we study the penalized least square estimator (PLSE) (3.2) with general
concave penalties, including the Lasso, SCAD and MCP as special cases. We prove that the
PLSE subject to noise or missingness achieves the same scale of coefficients estimation error
as the full observed design, based only on the restricted eigenvalue condition. Compared to
Loh and Wainwright [45], we require no further side constraints or the knowledge of true
coefficients. Compared to Datta and Zou [14], we solve an exact solution of (3.2) instead of
an approximation. More importantly, our approach is not limited to the `1 penalty Lasso
and applies to general concave penalties. Furthermore, we prove that a linear combination
44
of the `2 norm of coefficients and noise level is sufficient for penalty level when noise or
missingness exists. This sharpens the existing results which use an `1 norm of regression
coefficients for penalty level. Based on this, we extend the scaled PLSE to noisy and missing
designs. Since the cross-validation based technique may be misleading for missing or noisy
data, the proposed scaled solution is of great use. All our consistency results applies to the
case where the number of predictors p is much larger than the sample size n.
The rest of this chapter is organized as follows: In Section 3.2, we present the coefficients
estimation bounds of the PLSE for missing/noisy designs along with some definitions
and assumptions on the regularizers and design matrices. In Section 3.3, we discuss
the theoretical choice of penalty level. Section 3.4 extends the scaled PLSE and proves
consistency of noise level estimation. Section 3.5 contains discussion.
3.2 Theoretical Analysis of PLSE
In this section, we derive the coefficients estimation error bounds of the concave PLSE for
missing and noisy design. The class of penalty functions we studied in this chapter follows
that in Section 2.2.
3.2.1 Restricted eigenvalue conditions
In high-dimensional regression with fully observed design, the restricted eigenvalue condition
(RE, 6) can be viewed as the weakest available condition on the design matrix to guarantee
desired statistical properties. When the designs are subject to disturbance, the same type
of RE condition can also be applied. For a given (not necessarily semi-definite) matrix Γ
and η ă 1, the restricted eigenvalue (RE) can be defined as
RE22pΓ,S; ηq “ inf
"
uTΓu
u22: p1´ ηquSc1 ď p1` ηquS1
*
. (3.9)
The RE condition refers to the property that RE2pΓ,S; ηq is bounded away from certain
non-negative constant. Similarly, we define the compatibility coefficient, which can be
45
viewed as the `1-RE [77],
RE21pΓ,S; ηq “ inf
"
uTΓu|S|uS
21
: p1´ ηquSc1 ď p1` ηquS1
*
. (3.10)
We note that RE2pΓ,S; ηq is aimed for `2 coefficients estimation error, while RE21pΓ,S; ηq
is for `1 estimation error. When the designs are fully observed, the RE1 and RE2 are
guaranteed to be non-negative due to the positive semi-definite Γ “ Σ “ XTXn. When
noise or missingness come into the design, neither Γ “ pΓnoi nor pΓmis is positive semi-definite.
Thus the RE1 and RE2 defined in (3.9) and (3.10) may be negative.
3.2.2 Main results
For a given penalty ρp¨; ¨q, the Karush-Kuhn-Tucker (KKT) type conditions for λ‘pβ P R1`p
to be a critical point of program (3.2) is
$
’
’
&
’
’
%
pγj ´ ppΓpβqj “ 9ρppβj ;λq, pβj ‰ 0
|ppγj ´ ppΓpβqj | ď λ, pβj “ 0.
(3.11)
For fully observed design and convex penalty, the local KKT condition is a necessary and
sufficient condition for a global minimizer. For general cases where missing or noisy data
exist, solution of (3.11) include all local minimizers of (3.1).
Similar as Chapter 2, we consider the class of all penalty functions ρp¨;λq with no smaller
penalty level than λ˚ and no greater concavity than κ˚.
Ppλ˚, κ˚q “ The set of all penalties satisfying (i) to (iv) in Chapter 2
with λ ě λ˚ and κpρλq ď κ˚
Then we define the local solution set that we consider here. Let
Bpλ˚, κ˚q “ The set of all solutions of (2.12) for some ρp¨;λq PPpλ˚, κ˚q.
be the class of all local solutions for penalty ρp¨;λq P Bpλ˚, κ˚q. The local solution set
46
we considered here is the subclass of Bpλ˚, κ˚q that connected to the origin through a
continuous path. Formally, denote
B0pλ˚, κ˚q “ The set of all vectors connected to 0 in Bpλ˚, κ˚q.
The penalty level we considered here is no smaller than a certain λ˚ satisfying
pγ ´ pΓβ˚8 ă ηλ˚ (3.12)
for certain constant 0 ă η ă 1. We will provide specifications of λ˚ in different disturbance
scenarios, e.g. missing or noisy data, in Section (3.3).
Theorem 3.1. Let pβ be a local minimizer in B0pλ˚, κ˚q with a penalty ρp¨;λq PPpλ˚, κ˚q.
Suppose pΓ satisfies RE22ppΓ,S; ηq ě κ˚. Let ξ “ p1 ` ηqp1 ´ ηq, if ppΓ, pγq satisfies (3.12),
then
pβ ´ β˚q ď
$
’
’
’
’
’
&
’
’
’
’
’
%
4ξλ|S|p1´ ηqRE2
1pS; ηq, q “ 1
2ξλ|S|12
RE1pS; ηqRE2pS; ηq, q “ 2.
(3.13)
In terms of `1 and `2 coefficients estimation error bounds, (3.13) can be viewed as a
generalization of (2.20) as noisy or missing data is allowed in the design matrix. We can
see that (3.13) obtains the same form of coefficients estimation error bounds compared with
(2.20) with no additional assumption. Together with Theorem 2.1, we provide a unified
treatment of penalized least squares methods, including the `1 and concave penalties for
fully observed or missing/noisy data, under the RE condition on the design matrix and
natural conditions on the penalty.
3.3 Theoretical penalty levels for missing/noisy data
In Section 2, the universal penalty level λ˚ can be viewed as a probabilistic upper bound of
pγ ´ pΓβ8. When the design is fully observed, a well-known upper bound for pγ ´ pΓβ8 “
47
XTεn8 would be Aσa
p2nq log p with A being certain constant. In this section, we
provide tight upper bound for pγ ´ pΓβ8 under missing or noisy data scenarios.
Theorem 3.2. Suppose that each row of X are iid zero-mean sub-Gaussian random vectors
with parameters pΣx, σ2xq, let σ be the noise level. Suppose n Á log p.
(i) Additive noise: Let Z “ X `W be the observed design with noise matrix W .
Suppose each row of W are iid zero mean sub-Gaussian random vectors with parameters
pΣw, σ2wq and independent with X. Let σ2
z “ σ2w ` σ
2x,
pµ1, µ2q “
˜
A
c
log p
nσzσw, A
c
log p
nσz
¸
where A is certain constant. Then
pγnoi ´ pΓnoiβ˚8 ď µ1β
˚2 ` µ2σ
with probability at least 1´ c1 expp´c2 log pq.
(ii) Missing data: Let Z “Xη be the missing data design matrix with missing probability
π. Let
pµ1, µ2q “
˜
A
c
log p
n
` σ2x
1´ π`
σ2x
p1´ πq2˘
, A
c
log p
n
σx1´ π
¸
where A is certain constant. Then
pγmis ´ pΓmisβ˚8 ď µ1β
˚2 ` µ2σ
with probability at least 1´ c1 expp´c2 log pq.
Remark 3.1. When the missing probabilities of each column of design matrix are different,
Theorem 3.2 can be extended by letting π “ πmax “ max1ďjďp πj, with πj being the missing
probability in column j.
Theorem 3.2 provides a guidance on choosing the penalty level when noise or missing
data appears. It proves that a linear combination of the `2 norm of coefficients and noise
level is large enough to bound pγ ´ pΓβ˚8 under both scenarios. Compared with fully
48
observed design, an extra `2-norm of coefficients is required for penalty to compensate the
missingness or noise. Combing Theorem 3.2 and Theorem 3.1, we see that the coefficients
estimation bounds is on the order of
pβ ´ β˚2 —
c
s log p
npβ˚2 ` Cσq
pβ ´ β˚1 — s
c
log p
npβ˚2 ` Cσq (3.14)
for certain constant C.
The matrix-uncertainty (MU) selector in (3.8) [64, 65] is a Dantzig selector type
estimator for high-dimensional regression with noisy or missing data. In our notation,
the feasible set of the MU selector is pγ ´ pΓβ8 ď µβ1` τ for certain constants µ and τ .
Since for β ‰ 0, β2 is strictly smaller than β1, an MU selector with modified feasible
set pγ ´ pΓβ8 ď µβ2 ` τ may achieve shaper coefficients estimation error bounds.
Proof of Theorem 3.2 We first state the following lemmas, which will be used to prove
Theorem 3.2.
Lemma 3.1. Suppose X P Rnˆp1 and Y P Rnˆp2 are composed of independent rows of
covariates Xi,˚ and Y i1,˚ respectively, i “ 1, ..., p1, i1 “ 1, ..., p2. Assume Xi,˚ and Y i1,˚
are zero-mean sub-Gaussian vectors with parameters pΣx, σ2xq and pΣy, σ
2yq respectively. If
n Á log p, then
Pˆ
Y TX
n´ CovpY i,Xiq8 ě ε
˙
ď 6p1p2 exp
ˆ
´cnmin
"
ε2
pσxσyq2,
ε
σxσy
*˙
(3.15)
where c is certain constant.
Proof of Lemma 3.1. The proof of Lemma 3.1 can be seen in the supplementary material
of [44].
Now we can prove Theorem 3.2 (i). First note that the observed matrix Z “ X `W
has sub-Gaussian rows with parameters σ2x`σ
2w. This is because for any unit vector u P Rp,
E“
exppλuTZi,˚q‰
“ E“
exp`
λuT pXi,˚ `W i,˚q˘‰
ď exp`
p12qλ2pσ2x ` σ
2wq˘
. (3.16)
49
Moreover, Wβ˚ is also sub-Gaussian vector with parameter σ2wβ
˚22 since for any unit
vector v P Rn,
E“
exppλvTWβ˚q‰
“ E
«
exp
˜
nÿ
i“1
λvjβ˚2pβ
˚β˚2qTW i,˚
¸ff
ď
pź
j“1
exp`
λ2v2j β
˚22˘
“ exp`
λ2β˚22˘
. (3.17)
On the other hand,
pγnoi ´ pΓnoiβ˚8 “ ZTyn´ pZTZn´Σwqβ
˚8
ď ZT εn8 ` `
ZTW n´Σw
˘
β˚8 (3.18)
Thus, by the sub-Gaussianity Z, Wβ˚, ε and Lemma 3.1, we have
P
˜
ZTεn8 ě C
c
log p
nσzσ
¸
ď c1 expp´c2 log pq, (3.19)
and
P
˜
pZTW n´Σwqβ˚8 ě C
c
log p
nσzσwβ
˚2
¸
ď c1 expp´c2 log pq. (3.20)
Combing (3.18),(3.19) and (3.20), we have
P
˜
pγnoi ´ pΓnoiβ˚8 ě C
c
log p
npσzσ ` σzσwβ
˚2q
¸
ď c1 expp´c2 log pq. (3.21)
Similarly we can prove (ii). First note that Z “ Xη has sub-Gaussian rows with
parameter σx since for any unit vector,
E“
exppλuTZi,˚| missing values q‰
“ E“
exp`
λuTXi,˚
˘‰
ď exp`
p12qλ2σ2x
˘
.
Moreover, Xβ˚ and Zβ˚ are both sub-Gaussian vectors with parameter σ2xβ
˚2 by the
same argument of (3.17). On the other hand,
pγmis ´ pΓmisβ˚8
50
ď pγmis ´Σxβ˚8 ` ppΓmis ´Σxqβ
˚8
“1
1´ πZTyn´ CovpZi,Xiqβ
˚8 ` ppΓmis ´Σxqβ˚8
ď1
1´ πZTXβ˚n´ CovpZi,Xiβ
˚q8 `1
1´ πZTεn8
`ppΓmis ´Σxqβ˚8. (3.22)
By the sub-Gaussianity of Z, Xβ˚, ε and Lemma 3.1,
P
˜
1
nZTXβ˚ ´ CovpZi,˚,Xi,˚β
˚q8 ě C
c
log p
nσ2xβ
˚2
¸
ď c1 expp´c2 log pq, (3.23)
and
P
˜
ZTε
n8 ě C
c
log p
nσxσ
¸
ď c1 expp´c2 log pq. (3.24)
To control ppΓmis ´Σxqβ˚8, we define matrix M ,
Mij “ Epηiηjq “
$
’
’
&
’
’
%
p1´ πq2, i ‰ j,
1´ π, i “ j,
and covariance matrix Σz “ CovpZi,˚,Zi,˚q. Then
ppΓ´Σxqβ˚8 “
›
›
`
ZTZn´Σzq cM˘
β˚›
›
8ď
1
p1´ πq2ZTZβ˚n´Σzβ
˚8.
Thus
P
˜
ppΓ´Σxqβ˚8 ě C
c
log p
n
1
p1´ πq2σ2xβ
˚2
¸
ď c1 expp´c2 log pq. (3.25)
Combining (3.22), (3.23), (3.24) and (3.25), we conclude that
P
˜
pγmis ´ pΓmisβ˚8 ě C
c
log p
n
„
` σ2x
1´ π`
σ2x
p1´ πq2˘
β˚2 `σx
1´ πσ
¸
ď c1 expp´c2 log c2pq.
51
3.4 Scaled PLSE and Variance Estimation
In this section, we extend the scaled PLSE for fully observed data in Section 2.4 to the
missing or noisy data scenario.
We start by proposing the noise level estimator for missing or noisy data. Let us first
review the noise level estimator for fully observed design in a high-dimensional setting:
σ2pλq “ y ´Xpβpλq22n “pβTpλqΣpβpλq ´ 2zT pβpλq ` y22n (3.26)
with given penalty λ. When facing noisy/missing data issue, pΣ, zq is not directly available.
A natural estimation of noise level would be obtained by replacing pΣ, zq with ppΓ, pγq in
(3.26), and define
pσ2pλq “ pβTpλqpΓpβpλq ´ 2pγT pβ
Tpλq ` y22n. (3.27)
On the other hand, if the true coefficients is given, an “oracle” may estimate noise level
with noise/missing data by
tσou2 “ βT pΓβ ´ 2pγTβ ` y22n. (3.28)
σo can be viewed as a natural estimation targets for σ. Indeed, σo is not only the best noise
level estimator one can obtain in the missing/noisy data scenarios, σo is also close enough
to the true σ. In fact, one may prove that σo converge to σ˚ “ y ´Xβ˚2?n under
mild conditions. Since σ˚ is the maximum likelihood estimator for σ when β is known and
npσ˚σq2 follows the χ2n distribution under Gaussian assumption, this guarantees that σo
goes to σ under certain condition.
Given the the noise level estimator (3.27), the scaled PLSE is defined as
ppβscal
, σq “´
pβppλq, pσppλq¯
, pλ “ maxtλ : λ ď µ1pβpλq2 ` µ2pσpλqu (3.29)
where A and B are pre-known coefficients and changes depending on the nature of model
52
(noise or missing). For example, we may let
pµ1, µ2q “
˜
1
η
c
log p
nσzσw,
1
η
c
log p
nσz
¸
(3.30)
for noisy data, and let
pµ1, µ2q “
˜
1
η
c
log p
n
` σ2x
1´ π`
σ2x
p1´ πq2˘
,1
η
c
log p
n
σx1´ π
¸
(3.31)
for missing data with missing probability π, where η P p0, 1s is some constant. In light of
oracle noise level estimator σo, we define a oracle penalty level as
λo “ µ1β˚2 ` µ2σ
o. (3.32)
We will derive upper and lower bounds for pλλo´1 in following analysis. Before that, some
more definitions are needed. Let
τ1 “2ξ|S|12
RE1pS; ηqRE2pS; ηq(12
, τ2 “2|S|12 tp1` ηqp1` 3ηqu12
p1´ ηqRE1pS; ηq. (3.33)
Theorem 3.3. Let ppβscal
, pσq be the scaled penalized regression estimator in (3.29) with
pµ1, µ2q be in (3.30) for noisy design and in (3.31) for missing design. Let ξ “ p1´ηqp1`ηq,
λo be in (3.32), τ1, τ2 be in (3.33) and τ0 “ µ1τ1 ` µ2τ2 Suppose RE22pS; ηq ě κ˚ holds in
the solution path with λ ě λ˚. When pγ ´ pΓβ˚8 ă p1´ τ0qηλo, we have
max
˜
1´pλ
λo, 1´
λo
pλ
¸
ď τ0, (3.34)
Moreover,
|pσ ´ σo| ď pλτ2 ďλoτ2
1´ τ0, βscal ´ β˚ ď pλτ1 ď
λoτ1
1´ τ0. (3.35)
Remark 3.2. If one choose pµ1, µ2q as in (3.30) for noisy design and as in (3.31) for
missing design, we have pλÑ λo when psnq log pÑ 0.
53
Theorem 3.3 guarantees the consistency of penalty level estimation via an oracle
inequality for the prediction error of the concave PLSE for missing/noisy design.
Proof of Theorem 3.3. We first consider penalty level λ0 “ λop1´τ0q. Since pγ´pΓβ8 ă
λ0η, it follows from Theorem 3.1 that
pβpλ0q ´ β˚2 ď λ0τ1.
where τ1 “ 2ξ|S|12 tRE1pS; ηqRE2pS; ηqu12. Thus,
pβpλ0q2 ě β˚2 ´ pβpλ0q ´ β
˚2 ě β˚2 ´ λ0τ1. (3.36)
Now denote Λpβq “ p12qβT pΓβ ´ βTpγ. By Taylor’s expansion,
Λ`
pβpλ0q˘
“ Λpβ˚q `∇Λpβ˚qT`
pβpλ0q ´ β˚˘
` p12q`
pβpλ0q ´ β˚˘T
pΓ`
pβpλ0q ´ β˚˘
ě Λpβ˚q `∇Λpβ˚qT`
pβpλ0q ´ β˚˘
` p12qκ˚pβpλ0q ´ β˚22
ě Λpβ˚q `∇Λpβ˚qT`
pβpλ0q ´ β˚˘
. (3.37)
where the first inequality holds because λ0 ě λ˚ and thus RE22pS; ηq ě κ˚. It then follows
that
Λpβ˚q ´ Λ`
pβpλ0q˘
ď ´∇Λpβ˚qT`
pβpλ0q ´ β˚˘
ď pγ ´ pΓβ˚8pβpλ0q ´ β˚1
ď ηλ0pβpλ0q ´ β˚1
ď4λ2
0|S|
pη ` η2qp1´ ηq2(
RE21pS; ηq
( ă λ20τ
22 2, (3.38)
The forth inequality holds by Theorem 3.1. Then we have
pσpλ0q “ σ˚ ´´
b
2Λ`
β˚˘
` y22n´
b
2Λ`
pβpλ0q˘
` y22n¯
ě σ˚ ´ λ0τ2. (3.39)
Combining (3.36) and (3.39), we have that
µ1pβpλ0q2 ` µ2pσpλ0q ě µ1
`
β˚2 ´ λ0τ1
˘
` µ2
`
σ˚ ´ λ0τ2
˘
54
“ µ1β˚2 ` µ2σ
˚ ´ λ0
`
µ1τ1 ` µ2τ2
˘
“ λo`
1´ p1´ τ0qτ0
˘
ě λo`
1´ τ0
˘
“ λ0.
Since pλ “ maxtλ : λ ď µ1pβpλq2 ` µ2pσpλqu, we have
pλ ě λ0 “ λop1´ τ0q ě λ˚. (3.40)
Now consider penalty level pλ. Since pλ ě λ˚, it follows from Theorem 3.1 that pβscal
´
β˚2 ď pλτ1. This proves the second part of (3.35). Furthermore,
pβscal2 ď β
˚2 ` pβscal
´ β˚2 ď β˚2 ` pλτ1. (3.41)
Similarly,
Λ`
pβscal˘
“ Λpβ˚q `∇Λpβ˚qT`
pβscal
´ β˚˘
` p12q`
pβscal
´ β˚˘T
pΓ`
pβscal
´ β˚˘
ď Λ`
β˚˘
` pγ ´ pΓβ˚8pβscal
´ β˚1 ` p12q`
pβ ´ β˚˘T
pΓ`
pβ ´ β˚˘
ď Λ`
β˚˘
` ηpλpβscal
´ β˚1 ` p12q`
pβscal
´ β˚˘T
pΓ`
pβscal
´ β˚˘
(3.42)
Since pλ ě λ˚, by Theorem 3.1,
pβscal
´ β˚1 ď4ξpλ|S|p1´ ηq
RE21pS; ηq
, ppβscal
´ β˚qT pΓppβscal
´ β˚q ď
2ξpλ(2|S|
RE21pS; ηq
.
Put above into (3.42), we obtain
Λ`
pβscal˘
´ Λpβ˚q ď2pλ2|S|
p1` ηqp1` 3ηqp1´ ηq2(
RE21pS; ηq
“ pλ2τ22 2.
It then follows that
pσ “ σ˚ `´
b
2Λ`
pβscal˘
` y22n´b
2Λ`
β˚˘
` y22n¯
ď σ˚ ` pλτ2 (3.43)
55
Combining (3.41) and (3.43), we have
pλ ď µ1pβscal2 ` µ2pσ ď µ1pβ
˚2 ` pλτ1q ` µ2pσ˚ ` pλτ2q
“ λo ` pλ`
µ1τ1 ` µ2τ2
˘
“ λo ` pλτ0,
This implies pλ ď λop1´ τ0q. Combining with (3.40), we proves (3.34). On the other hand,
Λ`
pβscal˘
“ Λpβ˚q `∇Λpβ˚qT`
pβscal
´ β˚˘
` p12q`
pβscal
´ β˚˘T
pΓ`
pβscal
´ β˚˘
ě Λ`
β˚˘
`∇Λpβ˚qT`
pβscal
´ β˚˘
` pκ˚2qpβscal
´ β˚2
ě Λ`
β˚˘
´ pγ ´ pΓβ˚8pβscal
´ β˚1 (3.44)
It then follows that
Λpβ˚q ´ Λ`
pβscal˘
ď pγ ´ pΓβ˚8pβscal
´ β˚1 ă pλ2τ22 2,
Then we have
pσ “ σ˚ ´´
b
2Λ`
β˚˘
` y22n´
b
2Λ`
pβscal˘
` y22n¯
ě σ˚ ´ pλτ2. (3.45)
Then the first part of (3.35) follows from (3.43) and (3.45). ˝
3.5 Conclusions
In this chapter, we extend the PLSE to noisy or missing design and proved a rate-optimal
coefficients estimation error while requiring no additional condition. Moreover, we showed
that a linear combination of the `2 norm of coefficients and noise level is large enough for
penalty level when noise or missingness exists. This sharpens the commonly understood
results where `1 norm of coefficients is required. We further extend the scaled version
of PLSE to missing and noisy data case. Since the cross-validation based technique is
extremely time consuming and maybe misleading for missing or noisy data, the proposed
scaled solution is of great use.
56
Chapter 4
Group Lasso under Low-Moment Conditions on Random
Designs
4.1 Introduction
As discussed in previous chapters, the restricted eigenvalue (RE; 6) condition is among
the mildest imposed on the design matrix to guarantee desired statistical properties of
regularized estimators in high-dimensional regression. When the effects of design variables
are naturally grouped, the group Lasso has been shown to provide sharper results compared
with the Lasso [33], and such benefit of group sparsity has been proven under groupwise
RE conditions [54, 47, 52]. However, the RE condition is still somewhat abstract compared
with well understood properties of the design matrix such as sparse eigenvalue and
moment conditions. In this chapter, we prove that the groupwise RE and closely related
compatibility conditions can be guaranteed by a low moment condition for random designs
when the RE of the population Gram matrix is bounded away from zero. Our results include
the ordinary RE condition for the Lasso as a special case.
Consider a linear model
y “Xβ˚ ` ε, (4.1)
where X “ px1, ..., xpq P Rnˆp is the design matrix, y P Rn is the response vector, ε is a
noise vector with mean E ε “ 0 and covariance σ2Inˆn and β˚ P Rp is the target coefficients
vector. The group Lasso [83] can be defined as
pβpGq“ pβ
pGqpλq “ arg min
β
!
y ´Xβ222n
`
Jÿ
j“1
λjβGj2
)
, (4.2)
57
where tGj , 1 ď j ď Ju forms a partition of the index set t1, ..., pu and λ “ pλ1, ..., λJq P RJ
gives the penalty level. The Lasso [71] can be viewed as a special case of the group Lasso
with group size equal to one.
To guarantee desired prediction and estimation performance of the group Lasso,
restricted eigenvalue type conditions play a critical role. For any subset S Ď t1, ..., Ju
and positive number ξ, the groupwise RE for the `2 estimation error can be defined as
RE2˚pGqpΣ;S, ξ,λq “ inf
"
uTΣu
u22: u P CpGqpξ, S,λq
*
, (4.3)
where Σ P Rpˆp, λ specifies the penalty level in (4.2), and CpGqpξ, S,λq is a cone defined by
CpGqpξ, S,λq “!
u :ÿ
jPSc
λjuGj2 ď ξÿ
jPS
λjuGj2
)
. (4.4)
For the analysis of the prediction and weighted `2,1 estimation errorřJj“1 λj
pβpGq´ β2,
the groupwise compatibility coefficient (CC) can be defined as
CC2pGqpΣ;S, ξ,λq “ inf
#
uTΣuř
jPS λ2j
`ř
jPS λjuGj2˘2 : u P CpGqpξ, S,λq
+
. (4.5)
When |Gj | “ 1 and λj does not depend on j, the groupwise RE and CC reduces to their
original versions in [6] and [74] respectively. In what follows, the groupwise RE and CC
conditions respective refer to the case where the groupwise RE and CC are bounded away
from zero.
While somewhat different versions of the groupwise RE and CC have been considered in
[54], [75] and [47], we focus on verification of the groupswise RE and CC conditions with the
quantities defined in (4.3) and (4.5) as the theory associated with these quantities better
handle unequal group sizes and penalty levels. Specifically, the RE and CC in (4.3) and
(4.5) have been used in [52] to prove the following oracle inequalities.
Let β be a vector with supppβq P GS “ YjPSGj and pβpGq
the group Lasso estimator
in (4.2). Let Σ “ XTXn be the sample Gram matrix, ξ ą 1. Then, in the event
max1ďjďJ XTGjpy´Xβq2n ď λjpξ´ 1qpξ` 1q, the prediction and estimation loss of the
58
group Lasso are bounded by
XpβpGq´Xβ22n ď
C1ř
jPS λ2j
CC2pGqpΣ;S, ξ,λq
, (4.6)
Jÿ
j“1
λjpβpGq
Gj´ βGj
2 ďC2
ř
jPS λ2j
CC2pGqpΣ;S, ξ,λq
, (4.7)
and
pβpGq´ β2 ď
C3
`ř
jPS λ2j
˘12
RE2˚pGqpΣ;S, ξ,λq
, (4.8)
where C1, C2, C3 are constants depending on ξ only.
Substantial effort has been made to deduce the RE-type conditions from commonly
understood conditions such as moment and eigenvalue conditions. [6], [77], [86] and [82]
used lower and upper sparse eigenvalues to bound the RE and CC. Raskutti et al. [60]
proved RE condition for Gaussian designs under a population RE condition and a sample
size condition of the form n ě Cs log p. Rudelson and Zhou [66] further reduced the
sample size requirement from n ě Cs log p to n ě Cs logppsq and extended the results
to sub-Gaussian designs. To establish the RE condition for sub-Gaussian designs, they
proved a reduction/transfer principle showing that the RE condition can be guaranteed by
examining the restricted isometry on a certain family of low-dimensional subspaces. More
importantly, these results contribute significantly to the literature by removing the upper
eigenvalue requirement imposed in earlier analyses of regularized least squares such as the
restricted isometry property (RIP; 12, 11) for the Dantzig selector and the sparse Riesz
condition (SRC; 85, 84) for the Lasso. Lecue and Mendelson [42] further weakened the sub-
Gaussian condition of Rudelson and Zhou [66] on the design to an m-th moment condition
of order m ě C log p and a small-ball condition, while van de Geer and Muro [76] imposed
an m-th order isotropy condition with m ą 2 and tail probability conditions on the sample
second moment of the design variables with nonzero coefficients.
Compared with the rich variety of existing results on the RE-type conditions for the
59
Lasso, the literature on the validity of the groupwise RE-type conditions is rather thin.
Mitra and Zhang [52] proved that the groupwise RE and CC conditions hold for sub-
Gaussian matrices. However, their results require both the upper and lower eigenvalue
conditions on the population Gram matrix.
In this chapter, we show that the groupwise RE condition can be guaranteed by a
low moment condition on the design matrices when the population RE is bounded away
from zero. Specifically, we prove that the groupwise RE condition holds under: (i) a second
moment uniform integrability assumption on the linear combinations of the design variables
and (ii) a fourth moment uniform boundedness assumption on the individual design variables
and a m-th moment assumption on the linear combinations of the within group variables
for m ą 2, given a corresponding population RE condition and the usual sample size
requirement. Moreover, the fourth and m-th moment assumption can be removed given a
slightly larger sample size. Besides, the groupwise CC condition could also be guaranteed
under same type of low moment conditions. All the results include the RE-type conditions
for the Lasso as a special case. Our results indicate that accurate statistical estimation
and prediction is feasible in high-dimensional regression with grouped variables for a broad
class of design matrices. Furthermore, it also provide a theoretical foundation for the
bootstrapped penalized least-square estimation.
The rest of this chapter is organized as follows. In Section 4.2, we review existing
restricted-eigenvalue type conditions. In Section 4.3, we prove a group transfer principle
which is the key to proving the RE and CC conditions. In Section 4.4 and 4.5 we study
the groupwise CC and RE conditions respectively. In Section 4.6 we study the convergence
of the groupwise restricted eigenvalue and compatibility coefficient. Section 4.7 provides
some additional lemmas that used to prove the CC and RE conditions. Section 4.8 contains
discussion.
Notation: Throughout this chapter, we let X be the normalized design matrix, i.e.,
xj22 “ n, and β˚ be the corresponding vector of true regression coefficients. For a
vector v “ pv1, ..., vpq, vq “ř
jp|vj |qq1q denotes the `q norm, and |v|8 “ maxj |vj |,
v0 “ #tj : vj ‰ 0u. For a matrix M , M2 “ supu2“1 Mu2 be the operator norm,
φminpMq and φmaxpMq be the minimum and maximum eigenvalues of M respectively.
60
For a number x, rxs denotes the smallest integer larger than x. Moreover, we let the set
S P p1, ..., pq for the Lasso and S P p1, ..., Jq for the group Lasso.
4.2 A review of restricted eigenvalue type conditions
In this section, we briefly review the existing conditions on the designs required by the
Lasso and group Lasso to achieve the oracle properties. We let the set S P p1, ..., pq for the
Lasso and S P p1, ..., Jq for the group Lasso in this chapter.
Before the introduction of the RE condition, the restricted isometry property (RIP;
12, 11) and sparse Riesz condition (SRC; 85, 84) were imposed to analyze the Dantzig
selector and Lasso respectively. Candes and Tao [11] further improved the RIP condition
and named it the uniform uncertainty principle (UUP). The RIP and UUP conditions are
specialized for random designs with covariance matrix close to Ipˆp, while the SRC condition
works for more general random designs.
Bickel et al. [6] proposed the RE condition and under which provided oracle inequalities
for prediction and coefficients estimation for the Lasso. For a positive number ξ, the ordinary
RE coefficient REpΣ;S, ξq for the prediction and `1 coefficients estimation takes the form
RE2pΣ;S, ξq “ inf
"
uTΣu
uS22: uSc1 ď ξuS1
*
. (4.9)
The RE coefficient for the `2 estimation error takes the form
RE2˚pΣ;S, ξq “ inf
"
uTΣu
u22: uSc1 ď ξuS1
*
. (4.10)
The ordinary RE (4.10) is a special case of (4.3) when group size dj “ 1. Moreover, the
groupwise version of RE (4.9) takes the form [54, 47]
RE2pGqpΣ;S, ξ,λq “ inf
"
uTΣu
uS22: u P CpGqpξ, S,λq
*
.
We note that CC2pGqpΣ;S, ξ,λq ě RE2
pGqpΣ;S, ξ,λq ě RE2˚pGqpΣ;S, ξ,λq. Same as the
groupwise CC, RE2pGqpΣ;S, ξ,λq is also aimed at the prediction and the mixed `2,1 estimation
61
errors. When dj “ 1, the CC, originally formulated by van de Geer [74], becomes
CC2pΣ;S, ξq “ inf
"
uTΣu|S|
uS21: uSc1 ď ξuS1
*
. (4.11)
van de Geer and Buhlmann [77] proved that the prediction and `1 estimation loss of the
Lasso are under control when the CC is bounded away from zero.
The restricted strong convexity (RSC) condition introduced by Negahban et al. [55],
could also be viewed as an RE-type condition with a slightly larger cone. The RSC for
prediction and `1 estimation can be written as
κ2pΣ;S, ξq “ inf
"
uTΣu
uS22: uSc1 ď ξ|S|12uS2
*
. (4.12)
The original RSC condition takes the form
uTΣu ě
$
’
’
&
’
’
%
α1u22 ´ τ1tplog pqnuu21, u2 ď 1
α2u2 ´ τ2
a
plog pqnu1, u2 ě 1,
(4.13)
for positive constants α1, α2 and nonnegative constants τ1, τ2. We prove in Section 4.8 that
the RSC condition (4.13) is equivalent to that κpΣ;A, ξq is bounded below by a positive
constant for any set A with cardinality |A| ď Cn log p. Moreover, the RSC coefficient for
`2 estimation error can be defined as
κ2˚pΣ;S, ξq “ inf
"
uTΣu
u22: uSc1 ď ξ|S|12uS2
*
. (4.14)
We further define the groupwise RSC coefficient for the prediction and `2,1 estimation as
κ2pGqpΣ;S, ξ,λq “ inf
"
uTΣu
uGS22
: u P C ˚pGqpξ, S,λq
*
, (4.15)
and the groupwise RSC coefficient for the `2 estimation as
κ2˚pGqpΣ;S, ξ,λq “ inf
"
uTΣu
u22: u P C ˚pGqpξ, S,λq
*
, (4.16)
62
where the cone C ˚pGqpξ, S,λq takes the form
C ˚pGqpξ, S,λq “
u :ÿ
jPSc
λjuGj2 ď ξ´
ÿ
jPS
λ2j
¯12uGS
2(
. (4.17)
4.3 The group transfer principle
The main purpose of this chapter is to show that the groupwise RE-type conditions can be
guaranteed by a low moment condition. The key to prove this is the group transfer principle.
Specifically, to control the RE-type coefficients, it is essential to minimize uTΣu in certain
cone. The “transfer principle” refers to the property that the cone in the minimization
problem can be transfered to a smaller cone of proper cardinality. Oliveira [57] proved the
transfer principle and use which proved the RE-type conditions. In this section, we provide
a groupwise version of the transfer principle.
Before introducing the group transfer principle, we first state the strong group sparsity
condition of Huang and Zhang [33]. A coefficients vector β P Rp is strongly group-sparse if
there exists integers g and s such that
supppβq P GS “ YjPSGj , |S| ď g, |GS | ď s. (4.18)
Further, we define that
dj “ |Gj |, j “ 1, ..., J and maxjPSc
dj “ d˚Sc , maxjPS
dj “ d˚S . (4.19)
We let C and c denote generic positive constants in all our theoretical results. Their values
may vary in different expressions but they remain universal constants.
Theorem 4.1. Let d˚Sc and d˚S be as in (4.19), λ˚ “ maxjPSc λj, ξ ą 0, σ be regression
noise level. Suppose that λj “ pa
dj `?
2 log JqA0σ?n with certain A0 ą 1, j “ 1, ..., J
and (4.18) holds for certain integers g, s. For any L ą 0, define k˚ “ CLξ2`
d˚Sc`s`g log J˘
and s˚ “ max!
ř
jPA dj : A P Sc,ř
jPApd12j `
?2 log Jq2 ď k˚
)
. Then,
63
(i) For any L ą 0 and u P Rp, there exist v P Rp such that
v0 ď s˚, vGS“ uGS
,ÿ
jPSc
λjvGj2 “ÿ
jPSc
λjuGj2, (4.20)
and
uTΣu ě vTΣv ´min! 1
L,
`ř
jPS λ2j
˘12
Lλ˚
)
. (4.21)
(ii) Let ε ą 0 and D be a block diagonal matrix with block corresponding to G. Ifř
jPS λ2jφmaxpDGj ,Gj q ď p1 ` εq
ř
jPS λ2j and φminpDGj ,Gj q ě 1p1 ` εq,@j P Sc, then for
any ε1 ą 0, L ą 0 and u P Rp, there exist v P Rp such that (4.20) holds and
uTΣu
D´12u22ě p1´ ε1q
vTΣv
D´12v22´ C min
! 1
L,
`ř
jPS λ2j
˘12
Lλ˚
)
, (4.22)
where C is a constant that depending on ε and ε1 only.
Remark 4.1. A more explicit form of the k˚ in the definition of cardinality s˚ is
k˚ “ 4Lξ2 max
U, V(
` p43qU,
where U “`
td˚Scu12 `
?2 log J
˘2, V “
ř
jPS
`
d12j `
?2 log J
˘2. We do not seek the best
(smallest) cardinality here. In fact, smaller s˚ can be found easily.
We note that (4.21) aims to prove the compatibility condition, while (4.22) aims for the
restricted eigenvalue condition. Indeed, to prove the CC condition, one may only need to
control the nonzero part of u, or uGS. However, the whole vector u need to be controlled
in order to prove the RE condition.
A special case of Theorem 4.1 is when dj “ 1, j “ 1, ..., J , the transfer principle (4.1)
and (4.2) can be written as
v0 ď s˚ “ rLξ2ss, vGS“ uGS
, vS1 “ uS1 and
uTΣu ě vTΣv ´ ξ2ss˚. (4.23)
64
Note that (4.23) is close to the transfer principle proved by Oliveira [57], and is also the
key for van de Geer and Muro [76] to prove the ordinary CC condition under low moment
conditions.
Proof of Theorem 4.1. We apply a stratified version of Maurey’s empirical method
to groups of different sizes. Let λ˚ “ minjPSc λj , λ1˚ “ minjPS λj , J0 “ S,
Jk “
j P SczJk´1 : λj ď 2kλ˚(
, k “ 1, . . . , k˚,
with k˚ “ rlog2pλ˚λ˚qs ` 1. Let u P CpGqpξ, S,λq. Define U j P Rp by tU juGj “ uGj and
tU juGi “ 0¯, i ‰ j. Let Zpi,kq be independent vectors independent of X with
P!
Zpi,kq “ π´1j,kU j
)
“ πj,k “λjuGj2Itj P Jkuř
jPJkλjuGj2
. (4.24)
Let
Zpkq“
k´1ÿ
`“0
ÿ
jPJ`
U j `
k˚ÿ
`“k
m´1`
mÿ
i“1
Zpi,`q, k “ 1, ..., k˚. (4.25)
Also let Zpk˚`1q
“ u. As E“
Zpkqˇ
ˇZpk`1q
,Σ‰
“ Zpk`1q
,
E“`
Zpkq˘T
ΣZpkqˇ
ˇZpk`1q
,Σ‰
ď`
Zpk`1q˘T
ΣZpk`1q
`m´1k
ÿ
jPJk
π´1j,ku
TGj
ΣGj ,GjuGj
“`
Zpk`1q˘T
ΣZpk`1q
`m´1k
´
ÿ
jPJk
λjuGj2
¯´
ÿ
jPJk
uGj2λj
¯
ď`
Zpk`1q˘T
ΣZpk`1q
`
´
p2k´1λ˚q2mk
¯´1´ ÿ
jPJk
λjuGj2
¯2.
Let
mk “
RLξř
jPJkλjuGj2
´
ř
jPS λ2j
¯12max
´
λ˚,`ř
jPS λ2j
˘12¯
p2k´1λ˚q2ř
jPS λjuGj2
V
. (4.26)
65
It follows that
E“`
Zp1q˘T
ΣZp1qˇˇΣ
‰
ď uTΣu`k˚ÿ
k“1
´
p2k´1λ˚q2mk
¯´1´ ÿ
jPJk
λjuGj2
¯2
ď uTΣu`min! 1
L,
`ř
jPS λ2j
˘12
Lλ˚
)
ˆ
´
ř
jPS λjuGj2
¯2
ř
jPS λ2j
. (4.27)
Moreover, as
k˚ÿ
k“1
p2k´1λ˚q2mk ď Lξ2
`
ÿ
jPS
λ2j
˘12max
!
λ˚,`
ÿ
jPS
λ2j
˘12)
` p43qpλ˚q2,
for λj “ pa
dj `?
2 log JqA0σ?n, we have
ÿ
jPSc,pZp1qqGj
‰0
´
d12j `
a
2 log J¯2
ď 4Lξ2 max
U, pUV q12(
` p43qU
ď 4Lξ2 max
U, V(
` p43qU,
where U “`
td˚Scu12 `
?2 log J
˘2, V “
ř
jPS
`
d12j `
?2 log J
˘2. Therefore, there exists v
satisfying vGSc 0 ď s˚ with
s˚ “ max!
ÿ
jPA
dj : A P Sc,ÿ
jPA
´
d12j `
a
2 log J¯2ď cLξ2
d˚Sc ` s` g log J(
)
for certain constant c. Moreover, whenř
jPS λjuGj2 “
´
ř
jPS λ2j
¯12,
vGS“ uGS
,ÿ
jPSc
λjvGj2 “ÿ
jPSc
λjuGj2,
vTΣv ď uTΣu`min! 1
L,
`ř
jPS λ2j
˘12
Lλ˚
)
.
This proves (4.20) and (4.21).
66
To prove (4.22), we first note that
pZp1qqTΣZ
p1qď uTGS
ΣGS ,GSuGS
`
k˚ÿ
k“1
max Zpi,kq22
ďÿ
jPS
uGj22 `
k˚ÿ
k“1
´
ř
jPJkλjuGj2
2k´1λ˚
¯2
ď
´
ÿ
jPS
λjuGj2
¯2ˆ
1
λ1˚
˙2
`
´
ÿ
jPSc
λjuGj2
¯2 k˚ÿ
k“1
´ 1
2k´1λ˚
¯2
ď
ˆ
1
pλ1˚q2`
4ξ2
3λ2˚
˙
ÿ
jPS
λ2j , (4.28)
Moreover, by the property of D,
pÿ
jPS
λ2j q
12 “ÿ
jPS
λjD12Gj ,Gj
D´12Gj ,Gj
uGj2
ďÿ
jPS
λjφ12maxpDGj ,Gj qD
´12Gj ,Gj
uGj2 ď p1` εq12
`
ÿ
jPS
λ2j
˘12››D
´12GS ,GS
uGS
›
›
2.
It follows that
›
›D´12Zp1q››
2ě
›
›D´12GS ,GS
Zp1qGS
›
›
2“
›
›D´12GS ,GS
uGS
›
›
2ě 1p1` εq12. (4.29)
By Lemma 4.2, for any 0 ă ε1 ă 1,
P!
›
›D´12Zp1q››
2
2ď p1´ ε1q
›
›D´12u›
›
2
2
)
ď k˚ exp
´cε21L(
,
where c is a constant depending on ε and ε1 only. This combines with (4.28) and (4.29), we
have
E
«
pZp1qqTΣZ
p1q
›
›D´12Z›
›
2
2
ˇ
ˇ
ˇ
ˇ
ˇ
X
ff
ď E
«
pZp1qqTΣZ
p1q
p1´ ε1q›
›D´12u›
›
2
2
ˇ
ˇ
ˇ
ˇ
ˇ
X
ff
`k˚ exp`
´ cε21L˘
p1` εq
ˆ
1
tλ1˚u2`
4ξ2
3λ2˚
˙
ÿ
jPS
λ2j .
67
Combing this with (4.27), we have that
E
«
pZp1qqTΣZ
p1q
›
›D´12Z›
›
2
2
ˇ
ˇ
ˇ
ˇ
ˇ
X
ff
ďuTΣu
p1´ ε1q›
›D´12u›
›
2
2
`1` ε
1´ ε1min
! 1
L,
`ř
jPS λ2j
˘12
Lλ˚
)
`k˚ exp`
´ cε21L˘
p1` εq
ˆ
1
tλ1˚u2`
4ξ2
3λ2˚
˙
ÿ
jPS
λ2j . (4.30)
Note that the third term of the RHS of (4.30) maybe of smaller order of the second term,
we have
E
«
pZp1qqTΣZ
p1q
›
›D´12Z›
›
2
2
ˇ
ˇ
ˇ
ˇ
ˇ
X
ff
ďuTΣu
p1´ ε1q›
›D´12u›
›
2
2
` C min! 1
L,
`ř
jPS λ2j
˘12
Lλ˚
)
(4.31)
holds for some constant C. Then (4.22) follows from (4.31). ˝
4.4 Groupwise compatibility condition
In this section, we prove that the groupwise compatibility condition, which is sufficient to
control the prediction and the mixed `2,1 estimation errors, can be guaranteed under low
moment conditions on random designs using the group transfer principle.
Following Yuan and Lin [83], suppose that the design matrix X in linear model (4.1)
is normalized such that XTGjXGjn “ Idjˆdj for j “ 1, ..., J . In random designs, this
corresponds to
XGj “ĂXGj
rΣ´12
Gj ,Gj, rΣ “ĂX
TĂXn, j “ 1, . . . , J. (4.32)
where ĂX P Rnˆp is the original design matrix before normalization and composed of
independent rows of observed covariates, ĂXi,˚ “ p rXij , j “ 1, . . . , pq from the i-th data
point. Further, let
Σ “XTXn and Σ “ ErΣ (4.33)
be the normalized sample Gram matrix and original population Gram matrix respectively.
68
Moreover, let rD and D be the block diagonal matrix of rΣ and Σ respectively,
rDGj ,Gj “rΣGj ,Gj , DGj ,Gj “ ΣGj ,Gj , j “ 1, . . . , J.
The main reason using normalized X is the explicitness in the choice of the corresponding
penalty level λj in (4.2). For example, by Huang and Zhang [33], λj can be taken as
pa
dj`?
2 log JqA0σ?n for certain A0 ą 1 when noise level σ is known. Finally, we define
two events related to the original sample Gram matrix. For any ε ą 0, let
ΩS “
!
φmaxp rDGj ,Gj q ď 1` ε, @j P S)
,
ΩS “
!
ÿ
jPS
λ2jφmaxp
rDGj ,Gj q ď p1` εqÿ
jPS
λ2j
)
. (4.34)
Theorem 4.2. Suppose ΣGj ,Gj “ Idjˆdj , j “ 1, ..., J . Let ε ą 0, ξ ą 0, λ, λ˚, g, s, d˚S
and d˚Sc be as in Theorem 4.1, ΩS and ΩS be in (4.34). Let κ2pGq
`
Σ;S, ¨,λ˘
be in (4.15),
CpGq`
¨, S,λ˘
and C ˚pGq
`
¨, S,λ˘
be in (4.4) and (4.17) respectively. For any L ą 0, define
k˚ “ CLξ2`
d˚Sc`s`g log J˘
and s˚ “ max!
ř
jPA dj : A P Sc,ř
jPApd12j `
?2 log Jq2 ď k˚
)
.
(i) Suppose
L ě”
εCC2pGq
`
Σ;S, p1` εqξ,λ˘
ı´1. (4.35)
Suppose that the following variable class is uniformly integrable,
inf!
pĂXi,˚uq2 : u P CpGq
´
p1` εqξ, S,λ¯
, uSc0 ď s˚,uTΣu “ 1,@i)
. (4.36)
Then, if n ě Cs˚ log
eps˚(
,
P!
CCpGqpΣ;S, ξ,λq ě p1´ 3εq12CCpGqpΣ;S,`
1` ε˘
ξ,λq)
Ñ 1´ PpΩcSq, (4.37)
where C is a constant depending on ε only.
(ii) Suppose L ě”
εκ2pGq
`
Σ;S, p1` εqξ,λ˘
ı´1and
inf!
pĂXi,˚uq2 : u P C ˚pGq
´
p1` εqξ, S,λ¯
, uSc0 ď s˚,uTΣu “ 1,@i)
(4.38)
69
is uniformly integrable. Then, if n ě Cs˚ log
eps˚(
,
P!
CCpGqpΣ;S, ξ,λq ě p1´ 3εq12κpGqpΣ;S,`
1` ε˘
ξ,λq)
Ñ 1´ PpΩcSq. (4.39)
Remark 4.2. The assumption that ΣGj ,Gj “ Idjˆdj does not lose any generality since
replacing ĂXi,˚ by ĂXi,˚D´12 yields the same X in (4.32).
In Theorem 4.2, we proved that the population version groupwise CC condition implies
its sample version under a second moment assumption on the linear combinations of the
design variables with probability 1 ´ PpΩcSq. Also, under same type of second moment
assumption, the groupwise CC condition could also be guaranteed given a restricted strong
convexity condition with probability 1´ PpΩcSq.
The next question is how large PpΩcSq and PpΩc
Sq would be. The following theorem
guarantees that both PpΩcSq and PpΩc
Sq go to zero under a fourth moment assumption on
the individual design variables and a m-th moment assumption on the linear combinations
of the design variables within each group j P S, with m ą 2. Moreover, the fourth and
m-th moment assumption can be removed given a slightly larger sample size.
Theorem 4.3. Suppose ΣGj ,Gj “ Idjˆdj . Let ΩS ,ΩS ,λ, g, s and d be as in Theorem 4.1.
(i) Suppose that for any i “ 1, ...., n, j P S, EtĂXi,Gj42u and
sup
"
E|ĂXi,Gju|2q : q “
1
1´ c?s, u P Rdj
*
. (4.40)
are bounded, where c is a constant. If n ě Csd˚Splog sq,
P
ΩSu ě P
ΩSu Ñ 1. (4.41)
(ii) With no uniform boundedness condition, if n ě Csd˚Splog sq2,
P
ΩSu ě P
ΩSu Ñ 1. (4.42)
Remark 4.3. The sample size requirement n ě Csd˚Splog sq in part (i) is rather minimum.
70
It usually can be dominated by the sample size requirement n ě Cs˚ log
eps˚(
in Theorem
4.2.
Combing Theorem 4.2 and Theorem 4.3, we conclude that the groupwise CC condition
can be guaranteed under (i): a second moment uniform integrability assumption on the
linear combinations of the design variables and (ii) a fourth moment uniform boundedness
assumption on the individual design variables and a m-th moment assumption on the linear
combinations of the within group design variables for m ą 2, given a population CC or
RSC condition and a sample size n ě C
s˚ log
eps˚(
_ sd˚S log s(
. This is the usual
sample size requirement as sd˚Splog sq can usually be dominated by s˚ log
eps˚(
. If given
a slightly larger sample size n ě C
s˚ log
eps˚(
_ sd˚Splog sq2(
, the fourth and m-th
moment assumption can be further removed.
The ordinary CC is a special case of the groupwise CC. This leads to the following
corollary.
Corollary 4.1. Suppose diagpΣq “ Ipˆp. Let ε ą 0, ξ ą 0, ΩS and ΩS be the events
ΩS “
!
rΣj,j ď 1` ε, @j P S)
, ΩS “
!
ÿ
jPS
rΣj,j ď p1` εqs)
. (4.43)
(i)Suppose L ě”
εCC2`
Σ;S, p1` εqξ˘
ı´1and
inf!
pĂXi,˚uq2 : uSc1 ď p1` εqξuS1, uSc0 ď s˚,uTΣu “ 1,@i
)
, (4.44)
is uniformly integrable with s˚ “ rLξ2ss . Then, if n ě Cs˚ log
eps˚(
,
P!
CCpΣ;S, ξq ě p1´ 3εq12CCpΣ;S,`
1` ε˘
ξq)
Ñ 1´ PpΩcSq. (4.45)
(ii) Suppose L ě”
εκ2`
Σ;S, p1` εqξ˘
ı´1and
inf!
pĂXi,˚uq2 : uSc1 ď p1` εqξ
?suS2, uSc0 ď s˚,uTΣu “ 1,@i
)
71
is uniformly integrable with s˚ “ rLξ2ss. If n ě Cs˚ log
eps˚(
, then
P!
CCpΣ;S, ξq ě p1´ 3εq12κpΣ;S,`
1` ε˘
ξq)
Ñ 1´ PpΩcSq. (4.46)
(iii) Moreover, with no further assumption,
PpΩSq Ñ 1. (4.47)
If for any 1 ď i ď n, j P S, EĂX4
ij is bounded,
PpΩSq Ñ 1. (4.48)
Remark 4.4. One notable difference between the ordinary CC condition and the groupwise
version is that the fourth moment assumption on the individual design variable variables
can be removed given the restricted strong convexity condition on the population, under
usual sample size requirement. The proof of (4.47) is straightforward by the weak law of
large numbers. Corollary 4.1 is close to Theorem 5.3 of van de Geer and Muro [76], where
the difference is that they require a m ą 2 order isotropy condition instead of the second
moment uniform integrability condition .
Proof of Theorem 4.2. We first prove (i). Let v P CpGq,s˚pξ, S,λq and supposeř
jPS λjvGj2 “`ř
jPS λ2j
˘12. When L satisfies (4.35), take infinitum in the cone
CpGqpξ, S,λq of both sides of (4.21), we have
CC2pGqpΣ;S, ξ,λq ě inf
!
vTΣv : v P CpGq,s˚pξ, S,λq,ÿ
jPS
λjvGj2 “`
ÿ
jPS
λ2j
˘12)
´εCC2pGq
´
Σ;S, p1` εqξ,λ¯
. (4.49)
Now consider the RHS of (4.49). Let rv “ rD´12
v. Denote event
ΩSc “
"
φminp rDGj ,Gj q ě1
1` ε, @j P Sc
*
. (4.50)
72
In the event Ω “ ΩS X ΩSc , we have
ÿ
jPS
λjrvGj2 ě
ř
jPS λjvGj2
maxjPS φ12maxp rDGj ,Gj q
“
`ř
jPS λ2j
˘12
maxjPS φ12maxp rDGj ,Gj q
ě
`ř
jPS λ2j
˘12
?1` ε
, (4.51)
and
ÿ
jPSc
λjrvGj2 ď
ř
jPSc λjvGj2
maxjPSc φ12minp
rDGj ,Gj qď
ξř
jPS λjvGj2
maxjPSc φ12minp
rDGj ,Gj qď p1` εqξ
ÿ
jPS
λjrvGj2.
(4.52)
Thus, with ξ1 “ p1` εqξ, we have
p1` εq inf!
vTΣv : v P CpGq,s˚pξ, S,λq,ÿ
jPS
λjvGj2 “`
ÿ
jPS
λ2j
˘12)
ě p1` εq inf
$
&
%
ĂXrv22n
:ÿ
jPS
λjrvGj2 ě
`ř
jPS λ2j
˘12
?1` ε
, rv P CpGq`
ξ1, S,λ˘
, rvSc0 ď s˚
,
.
-
ě inf
#
ĂXrv22n
: rv P CpGqpξ1, S,λq,ÿ
jPS
λjrvGj2 “`
ÿ
jPS
λ2j
˘12, rvSc0 ď s˚
+
“ inf
#
rvTΣrvř
jPS λ2j
ř
jPS λjrvGj2ˆĂXrv22n
: rv P CpGqpξ1, S,λq, rvSc0 ď s˚, rvTΣrv “ 1
+
ě CC2pGqpΣ;S, ξ1,λq inf
#
1
n
nÿ
i“1
pĂXi,˚rvq2 : rv P CpGqpξ1, S,λq, rvSc0 ď s˚, rvTΣrv “ 1
+
.
Moreover, by Lemma 4.4, when n ě Cs˚ logteps˚u,
P
#
inf! 1
n
nÿ
i“1
pĂXi,˚rvq2 : rv P CpGqpξ1, S,λq, rvSc0 ď s˚, rvTΣrv “ 1
)
ě 1´ ε
+
Ñ 1.
Combing with (4.49), we have that
P!
CC2pGqpΣ;S, ξ,λq ě p1´ 3εqCC2
pGqpΣ;S,`
1` ε˘
ξ,λq)
ě P!
CC2pGqpΣ;S, ξ,λq ě p
1´ ε
1` ε´ εqCC2
pGqpΣ;S,`
1` ε˘
ξq,λ)
Ñ 1.
Now we need to control PtΩScu, by Lemma 4.4,
PtΩScu ě minjPSc
P!
inf 1
n
nÿ
i“1
pĂXi,˚uq2 : supppuq Ă Gj , u2 “ 1
(
ě1
1` ε
)
Ñ 1. (4.53)
73
Then PtΩScu Ñ 1 and (4.37) holds. To prove (4.39), note that in the event Ω “ ΩS XΩSc ,
pÿ
jPS
λ2j q
12 “ÿ
jPS
λj rD12
Gj ,GjrvGj2
ďÿ
jPS
λjφ12maxp
rDGj ,Gj qrvGj2 ď p1` εq12
`
ÿ
jPS
λ2j
˘12››
rvGS
›
›
2.
So that
rvGS2 ě 1
?1` ε, (4.54)
and
ÿ
jPSc
λjrvGj2 ď
ř
jPSc λjvGj2
maxjPSc φ12minp
rDGj ,Gj q
ďξř
jPS λjvGj2
maxjPSc φ12minp
rDGj ,Gj qď p1` εqξ
´
ÿ
jPS
λj
¯12rvGS
2. (4.55)
The remaining proof follows the same way. ˝
Proof of Theorem 4.3. It is easy to see that P
ΩSu ě P
ΩSu, we only need to prove
P
ΩSu “ P
φmaxprΣGj ,Gj q ď 1` ε, @j P S(
Ñ 1 for any ε ą 0. We first truncate ĂXi,Gj as
follows:
ĂXi,Gj “ĂXi,GjI
ĂXi,Gj22 ď an
(
`ĂXi,GjI
ĂXi,Gj22 ą an
(
,
where an will be defined different for proving (i) and (ii). Further, for any j P S, we let
Mi “ĂXi,GjĂXT
i,GjI
ĂXi,Gj22 ď anu ´ EĂXi,Gj
ĂXT
i,GjI
ĂXi,Gj22 ď anu.
To prove (i), let an “ nε0 for certain ε0 ą 0, we have
EMiMTi 2 “ EĂXi,Gj
ĂXT
i,GjĂXi,Gj
ĂXT
i,GjI
ĂXi,Gj22 ď an
(
2
ď maxu2“1
E|ĂXi,Gju|2ĂXi,Gj
221`
ĂXi,Gj22 ď an
˘
ď maxu“1
´
E|ĂXi,Gju|2q¯1q´
EĂXi,Gj42
¯1´1qa1´2p1´1qqn 2
74
ď Cd2p1´1qqj a1´2p1´1qq
n (4.56)
holds for some constant C, where the first inequality holds due to Holder’s inequality,
the second inequality holds due to the uniform boundedness of supu,i,j E|ĂXi,Gju|2q and
EĂXi,Gj42. Let σ2
j “ Cd2p1´1qqj a
1´2p1´1qqn and hpxq “ p1 ` xq logp1 ` xq ´ x. By Bennett
inequality in Tropp [73],
P!›
›
›
rΣGj ,GjI
max1ďiďn
ĂXi,Gj22 ď an
(
´ ErΣGj ,GjI
max1ďiďn
ĂXi,Gj22 ď an
(
›
›
›
2ą ε
)
ď dj exp
#
´nσ2
j
a2n
h´anε
σ2j
¯
+
ď dj exp
#
´nσ2
j
a2n
anε
σ2j
log´1
e`anε
eσ2j
¯
+
ď Cdj exp
#
´nε
anlog
´1
e`
anε
ed2p1´1qqj a
1´2p1´1qqn
¯
+
“ Cdj exp
#
´ε
ε0log
´1
e`εa
2p1´1qqn
ed2p1´1qqj
¯
+
ď Cdj exp
#
´2p1´ 1
q qε
ε0log
`andj
˘
+
ď Cdγ`1j
nγ
” γ
2p1´ 1qqε
ıγ, (4.57)
where γ “ 2p1´ 1q qεε0, C is a constant and vary in each inequalities. The second inequality
holds due to hpxq ě x logp1e` xeq. Fix γ “ 1 and 1´ 1q “ c?
log s, we have that when
n ě Cplog sqsd˚S ,
ÿ
jPS
P!›
›
›
rΣGj ,GjI
max1ďiďn
ĂXi,Gj22 ď an
(
´ ErΣGj ,GjI
max1ďiďn
ĂXi,Gj22 ď an
(
›
›
›
2ą ε
)
ďC?
log sř
jPS d2j
εnÑ 0. (4.58)
Moreover, the Markov inequality infers
P´
max1ďiďn,jPS
ĂXi,Gj22 ą ann
¯
ďn
`
ε0n˘2
ÿ
jPS
EĂXi,Gj42I´
ĂXi,Gj22 ą ε0n
¯
ď Clog s
ř
jPS d2j
nÑ 0. (4.59)
75
Combing above two inequalities, we have (4.40) holds.
To prove (ii), we let an “ ε0n log s for ε0 ą 0, we have
EMiMTi 2 ď EĂXi,Gj
ĂXT
i,GjĂXi,Gj
ĂXT
i,GjI
ĂXi,Gj22 ď an
(
2
ď pε0n log sq12EĂXi,GjĂXT
i,GjI
ĂXi,Gj22 ď ε0n log s
(
2
ď ε0n log s.
Let σ2j “ ε0n log s, and let ε0 satisfying 2
?ε0 ` p83qε0 “ ε. Then, by Bernstein inequality
in Tropp [73],
P!›
›
›
rΣGj ,GjI
max1ďiďn
ĂXi,Gj22 ď an
(
´ ErΣGj ,GjI
max1ďiďn
ĂXi,Gj22 ď an
(
›
›
›
2ą ε
)
ď Cdj exp´
´nε22
ε0n log s` 2nε0εp3 log sq
¯
“ Cdj exp´ ´p2 log sq
´
?ε0 ` p43qε0
¯2
ε0 ` p23qε0`
2?ε0 ` p83qε0
˘
¯
ď Cdjs2. (4.60)
Summing above inequality over j P S, we have
ÿ
jPS
P!›
›
›
rΣGj ,GjI
max1ďiďn
ĂXi,Gj22 ď an
(
´ ErΣGj ,GjI
max1ďiďn
ĂXi,Gj22 ď an
(
›
›
›
2ą ε
)
ď Cÿ
jPS
djs2“ Cs´1 Ñ 0. (4.61)
Moreover, by Markov inequality, when n ě Cplog sq2sd˚S ,
P´
maxi,jPS
ĂXi,Gj22 ą ε0n log s
¯
ďn
`
ε0n log s˘2
ÿ
jPS
EĂXi,Gj42I´
ĂXi,Gj22 ą ε0n log s
¯
ď Cplog sq2
ř
jPS d2j
nÑ 0.
Combing with (4.61), we have (4.42) holds. ˝
76
4.5 Groupwise restricted eigenvalue condition
In this section, we prove that the groupwise restricted eigenvalue condition, which is
sufficient to control the `2 estimation error, can be guaranteed under low moment conditions
in the same way as the groupwise CC in Section 4.4.
Recall that the design matrix X in linear model (4.1) is normalized such that
XTGjXGjn “ Idjˆdj . Let βporigq be the original regression coefficients in the linear model
y “ĂXβporigq ` ε (4.62)
before normalization of design matrix. As βporigq is the parameter of interest, it is natural
to use the loss function pβporigq
´ βporigq, where pβporigq
is the estimates for βporigq. If we
estimate βporigq by pβporigq
“ rD´12
pβpGq
, the loss function for pβporigq
can be written as
pβporigq
´ βporigq “ rD´12
ppβpGq´ β˚q.
This leads to the restricted eigenvalue for the `2 loss:
RE2˚pGqpΣ,
rD;S, ξ,λq “ inf
#
uTΣu
rD´12
u22
: u P CpGqpξ, S,λq
+
. (4.63)
Theorem 4.4. Suppose ΣGj ,Gj “ Idjˆdj . Let ε ą 0, ξ ą 0,λ, λ˚, g, s, d˚S and d˚Sc be as
in Theorem 4.1, ΩS and ΩS be in (4.34). Let κ2˚pGq
`
Σ;S, ¨,λ˘
be in (4.16), CpGq`
¨, S,λ˘
and C ˚pGq
`
¨, S,λ˘
be in (4.4) and (4.17) respectively. For any L ą 0, define k˚ and s˚ as in
Theorem 4.2.
(i) Suppose
L ěC min
!
1,`ř
jPS λ2j
˘12λ˚
)
εRE2pGq
`
Σ;S, p1` εqξ,λ˘ . (4.64)
Suppose that the following variable class is uniformly integrable,
inf!
pĂXi,˚uq2 : u P CpGq
´
p1` εqξ, S,λ¯
, uSc0 ď s˚,uTΣu “ 1,@i)
. (4.65)
77
If n ě Cs˚ log
eps˚(
, then
P!
RE˚pGqpΣ, rD;S, ξ,λq ě p1´ 4εq12RE˚pGq`
Σ;S, p1` εqξ,λ˘
)
Ñ 1´ PpΩcSq, (4.66)
where C is a constant depending on ε only.
(ii) Suppose
L ěC min
!
1,`ř
jPS λ2j
˘12λ˚
)
εκ2˚pGq
`
Σ;S, p1` εqξ,λ˘ .
Suppose that the following variable class is uniformly integrable,
inf!
pĂXi,˚uq2 : u P C ˚pGq
´
p1` εqξ, S,λ¯
, uSc0 ď s˚,uTΣu “ 1,@i)
If n ě Cs˚ log
eps˚(
, then
P!
RE˚pGqpΣ, rD;S, ξ,λq ě p1´ 4εq12κ˚pGq`
Σ;S, p1` εqξ,λ˘
)
Ñ 1´ PpΩcSq. (4.67)
(iii) Let ξ ą 1 and β be a vector with supppβq P GS “ YjPSGj. In the event ΩS Y ΩS,
if max1ďjďJ XTGjpy´Xβq2n ď λjη holds with η “ pξ´ 1qpξ` 1q, then the group Lasso
solution pβpGq
in (4.2) has
rD´12
pβpGq´ βporigq2 ď
p1` ηq`ř
jPS λ2j
˘12
RE˚pGqpΣ, rD;S, ξ,λqCCpGqpΣ;S, ξ,λq.
ďp1` εq12p1` ηq
`ř
jPS λ2j
˘12
RE2˚pGqpΣ,
rD;S, ξ,λq. (4.68)
Since PpΩcSq and PpΩc
Sq go to zero under a uniform boundedness condition by Theorem
4.3, Theorem 4.4 proved that the population groupwise RE (or RSC) condition implies the
sample RE under a low moment condition in the same way as the groupwise CC. Moreover,
the inequality (4.68) confirms that the `2 estimation error of original coefficients can be
bounded via RE2˚pGqpΣ,
rD;S, ξ,λq. The ordinary RE condition could also be viewed as a
special case of the groupwise version as in the following corollary.
78
Corollary 4.2. Suppose diagpΣq “ Ipˆp. Let ε ą 0, ξ ą 0, ΩS and ΩS be the events in
(4.43).
(i) Suppose L ě C”
εRE2˚
`
Σ;S, p1` εqξ˘
ı´1and
inf!
pĂXi,˚uq2 : uSc1 ď p1` εqξuS1, uSc0 ď s˚,uTΣu “ 1,@i
)
(4.69)
is uniformly integrable with s˚ “ rLξ2ss. If n ě Cs˚ log
eps˚(
, then
P!
RE˚pΣ, rD;S, ξq ě p1´ 4εq12RE˚`
Σ;S, p1` εqξ˘
)
Ñ 1´ PpΩcSq. (4.70)
(ii) Suppose ΩS, let L ě C”
εκ2˚
`
Σ;S, p1` εqξ˘
ı´1and
inf!
pĂXi,˚uq2 : uSc1 ď p1` εqξ
?suS2, uSc0 ď s˚,uTΣu “ 1,@i
)
is uniformly integrable with s˚ “ rLξ2ss. If n ě Cs˚ log
eps˚(
, then
P!
RE˚pΣ, rD;S, ξq ě p1´ 4εq12κ˚`
Σ;S, p1` εqξ˘
)
Ñ 1´ PpΩcSq. (4.71)
Existing results of the RE-type conditions are limited to the CC or the RE in (4.9),
which could only guarantee the prediction and `1 estimation error of the Lasso. Corollary
4.2 is the first result to guarantee the RE condition for `2 estimation error of the Lasso
under low moment conditions as to our knowledge. Proving the RE condition is also way
more difficult that that of the CC, because the whole vector u need to be controlled in order
to bound the RE. While for the CC condition, one may only need to control the nonzero
part of u.
Proof of Theorem 4.4. Consider u P CpGqpξ, S,λq and supposeř
jPS λjuGj2 “
`ř
jPS λ2j
˘12. In the event Ω “ ΩS X ΩSc , when L satisfies (4.35), take infinitum in the
cone CpGqpξ, S,λq of both sides of (4.22) and let ε1 “ ε, we have
RE2˚pGqpΣ,
rD;S, ξq ě p1´ εq inf
#
vTΣv
rD´12
v22
: v P CpGq,s˚pξ, S,λq
+
´εRE2˚pGq
´
Σ;S, p1` εqξ,λ¯
. (4.72)
79
Let rv “ rD´12
v. Recall (4.51) and (4.52), we have
ÿ
jPS
λjrvGj2 ě
`ř
jPS λ2j
˘12
?1` ε
,ÿ
jPSc
λjrvGj2 ď p1` εqξÿ
jPS
λjrvGj2.
Thus, with ξ1 “ p1` εqξ, we have
p1` εq inf! vTΣv
rD´12
v22
: v P CpGq,s˚pξ, S,λq)
ě p1` εq inf!
ĂXrv22nrv22
:ÿ
jPS
λjrvGj2 ě
`ř
jPS λ2j
˘12
?1` ε
, rvSc0 ď s˚, rv P CpGqpξ1, S,λq)
ě RE2˚pGqpΣ;S, ξ1,λq inf
!
ĂXrv22n : rvSc0 ď s˚, rv P CpGqpξ1, S,λq, rvTΣrv “ 1
)
.
By Lemma 4.4,
inf!
ĂXrv22n : rvSc0 ď s˚, rv P CpGqpξ1, S,λq, rvTΣrv “ 1
)
ě 1´ ε
holds with probability goes to 1. When L satisfies (4.64), it follows from (4.72) that
RE2˚,pGqpΣ,
rD;S, ξ,λq ěp1´ εq2
1` εRE2
˚pGq
`
Σ;S, ξ1,λ˘
´ εRE2˚pGq
´
Σ;S, p1` εqξ,λ¯
ě p1´ 4εqRE2˚pGq
`
Σ;S, ξ1,λ˘
.
Also, the upper bound of PpΩScq have been controlled in (4.53). The further bound in the
event Ω “ ΩS X ΩSc follows the same way.
To prove the original coefficients estimation error bound (4.68), we start with the basic
inequality. Let h “ pβpGq´ β, by Mitra and Zhang [52],
hTΣh ď p1` ηqÿ
jPS
λjhGj2 ´ p1´ ηqÿ
jPSc
λjhGj2.
Setting h “ λtu and optimizing over t we have
rD´12
pβpGq´ βporigq2 ď λCest,`2p
rD;S, ηq,
80
where Cest,`2 “ sup!
rD´12
u2
p1`ηqř
jPS λjuGj2´p1´ηqř
jPSc λjuGj2(
`: uTΣu “
1)
. Further,
Cest,`2 ď sup0ďtďξ
! rD´12
u2“
p1` ηqř
jPS λjuGj2 ´ p1´ ηqř
jPSc λjuGj2‰
uTΣu:
ÿ
jPS
λjuGj2 “ tÿ
jPSc
λjuGj2
)
ďsup0ďtďξ
!
“
p1` ηq ´ tp1´ ηq‰
)
`ř
jPS λ2j
˘12
RE˚pGqpΣ, rD;S, ξ,λqCCpGqpΣ;S, ξ,λq
ďp1` ηq
`ř
jPS λ2j
˘12
RE˚pGqpΣ, rD;S, ξ,λqCCpGqpΣ;S, ξ,λq. (4.73)
The first inequality in (4.68) holds. Moreover, for any u P CpGqpξ, S,λq,
uTΣu`ř
jPS λ2j
˘
`ř
jPS λjuGj2˘2 ě
uTΣu`ř
jPS λ2j
˘
`ř
jPS λjφ12maxp rDGj ,Gj q
rD´12
Gj ,GjuS2
˘2
ěuTΣu
`ř
jPS λ2j
˘
λ2jφmaxp rDGj ,Gj q
rD´12
GS ,GSuS22
ěuTΣup1` εq
rD´12
GS ,GSuS22
ěuTΣup1` εq
rD´12
u22
.
Taking infimum in the cone CpGqpξ, S,λq on both sides give us
CC2pGqpΣ;S, ξ,λq ě RE2
˚pGqpΣ,rD;S, ξ,λqp1` εq.
Combining with (4.73), we have the second inequality of (4.68) holds. ˝
4.6 Convergence of the restricted eigenvalue
In Section 4.4 and 4.5, we proved that the groupwise CC and RE condition can be controlled
under low moment conditions, given their population version condition hold. The next
question is whether the restricted eigenvalue truly converge to its population version.
Rudelson and Zhou [66] considered the case of bounded designs and proved the
convergence of the RE given certain sample size requirement. van de Geer and Muro [76]
81
further extended the bounded assumption to sub-Gaussian designs, but assume a strong
isotropy condition. In this section, we show that van de Geer and Muro’s results can be
generalized to the groupwise setting easily with no further condition.
Definition 4.1. Let m ą 2. A random vector X0 P Rp is strongly m-th order isotropic
with constant cm if for all u P Rp with EpXT0 uq
2 “ 1 it holds that
rEpXT0 uq
ms1m ď cm.
Definition 4.2. A random variable Z P R is sub-Gaussian with constant c if for all λ ą 0
it holds that
E exprλ|Z|s ď 2 exprλ2c22s.
Lemma 4.1. Let rΣ “ĂXTĂXn and Σ “ ErΣ. If ĂXi,˚ is strongly m-th order isotropic with
constant cm and its components ĂXi,j are sub-Gaussian with constant c, then for a universal
constant c1 and all t ą 0 with probability at least 1´ p1` n´mpm´2qq expr´ts,
supuT Σu“1,u1ďM
ˇ
ˇuT rΣu´ uTΣuˇ
ˇc1
ď cM
d
`
2t` 2 logp2pq ` 2m log npm´ 2q˘
plog p log3 n` tq
n
` c2M2
`
2t` 2 logp2pq ` 2m log npm´ 2q˘
plog p log3 n` tq
n
` c2m expr´tpm´ 2qmsn. (4.74)
This Lemma is quoted from van de Geer and Muro [76]. Observably, to achieve the
convergence, the sample size n need to satisfy the condition
log2 p log3 n ! n, pm´ 2q´1m log2 p log3 n ! n.
Now we generalize the results to the groupwise setting.
Theorem 4.5. Let rΣ “ ĂXTĂXn and Σ “ ErΣ. Suppose ĂXi,˚ is strongly m-th order
isotropic with constant cm and its components ĂXi,j are sub-Gaussian with constant c. Let
82
A0, σ, λ be as in Theorem 4.1. Then, for a universal constant c1 and all t ą 0 with
probability at least 1´ p1` n´mpm´2qq expr´ts,
supuT Σu“1,
řJj“1 λjuGj
2ďMA0σ?n
ˇ
ˇuT rΣu´ uTΣuˇ
ˇc1
ď cM
d
`
2t` 2 logp2pq ` 2m log npm´ 2q˘
plog p log3 n` tq
n
` c2M2
`
2t` 2 logp2pq ` 2m log npm´ 2q˘
plog p log3 n` tq
n
` c2m expr´tpm´ 2qmsn. (4.75)
The proof of Theorem 4.5 is straightforward. When λj “ pa
dj `?
2 log JqA0σ?n,
řJj“1 λjuGj2 ď MA0σ
?n indicate that
řJj“1 d
12j uGj2 ď M . By Cauchy-Schwarz
inequality,
uGS1 ď
Jÿ
j“1
d12j uGj2 ďM
Then (4.75) follows from (4.74).
4.7 Lemmas
In this section, we provide three lemmas that was used before.
Lemma 4.2. Let Zpi,kq
and Zpkq
be in (4.24) and (4.25) respectively. Let D be a block
diagonal matrix such thatř
jPS λ2jφmaxpDGj ,Gj q ď p1 ` εq
ř
jPS λ2j and φminpDGj ,Gj q ě
1p1` εq,@j P Sc. Then for any ε1 ą 0,
P!
›
›D´12Zp1q››
2
2ď p1´ ε1q
›
›D´12u›
›
2
2
)
ď k˚ exp
´cε21L(
.
where c is a constant that may depend on ε and ε1.
Proof of Lemma 4.2. We first note that
›
›D´12Zp1q››
2
2´›
›D´12u›
›
2
2“
`
Zp1q` u
˘TD´1
`
Zp1q´ u
˘
ě 2uTD´1`
Zp1q´ u
˘
.
83
Let ζpi,kq “ uTD´1pZpi,kq ´ř
jPJkUjq
›
›D´12u›
›
2
2for k “ 1, ..., k˚ and i “ 1, ...,mk. Let
ζpkq“
řmki“1 ζ
pi,kqmk, we have Erζpi,kqs “ 0. Since φminpDGj ,Gj q ě 1p1 ` εq,@j P Sc and
D´12u ě 1p1` εq12 by (4.29), we further have
|ζpi,kq ` 1| ď maxjPJk
pD´1uqGj2Zpi,kq2
›
›D´12u›
›
2
2
ď maxjPJk
D´12Gj ,Gj
pD´12uqGj2
´ 1
λj
ÿ
jPJk
λjuGj2
¯
›
›D´12u›
›
2
2
ď p1` εq12pD´12uqGj2`
2k´1λ˚q´1
ÿ
jPJk
λjuGj2›
›D´12u›
›
2
2
ďp1` εq12
2k´1λ˚
ÿ
jPJk
λjuGj2›
›D´12u›
›
2
ďp1` εq
ř
jPJkλjuGj2
2k´1λ˚. (4.76)
Moreover,
Epζpi,kqq2 “ÿ
jPJk
πj
˜
π´1j u
TGjD´1Gj ,Gj
uGj
›
›D´12u›
›
2
2
¸2
´pD´12uqGJk
42›
›D´12u›
›
4
2
ďÿ
jPJk
ř
jPJkλjuGj2
λjuGj2
pD´1uqGj2uGj2pD´12uqGj
22
›
›D´12u›
›
4
2
ď
ř
jPJkλjuGj2 maxjPJk pD
´1uqGj2
2k´1λ˚›
›D´12u›
›
2
2
pD´12uqGJk22
›
›D´12u›
›
2
2
ď
ř
jPJkλjuGj2p1` εq
12pD´12uqGj2
2k´1λ˚›
›D´12u›
›
2
2
ďp1` εq
ř
jPJkλjuGj2
2k´1λ˚. (4.77)
Define ξk “ p1 ` εqř
jPJkλjuGj2p2
k´1λ˚q and let mk be in (4.26). Combining (4.76),
(4.77) and by the Bernstein inequality,
P!
´ ζpkqě
?2´ 1
2?
22´pk
˚`1´kq2ε1
)
“ P!
´1
mk
mkÿ
i“1
ζpi,kq ě
?2´ 1
2?
22´pk
˚`1´kq2ε1
ˇ
ˇ
ˇ
ĂX)
ď exp
$
&
%
´mkp?
2´12?
22´pk
˚`1´kq2ε1q22
ξkr1` p?
2´12?
22´pk˚`1´kq2ε1q3s
,
.
-
“ exp
´ cε21L(
,
84
holds for any ε1 ą 0 and k “ 1, ...., k˚, where c is certain constant. Further we have
P!
›
›D´12Zp1q››
2
2´›
›D´12u›
›
2
2ď ´ε1
›
›D´12u›
›
2
2
)
“ P!
´
k˚ÿ
k“1
ζpkqě ε12
)
ď 1´k˚ź
k“1
P!
ζpkqď
?2´ 1
2?
22pk
˚`1´kq2ε1
)
ď 1´k˚ź
k“1
“
1´ exp
´ cε21L(‰
ď k˚ exp
´ cε21L(
.
˝
Lemma 4.3. For any A Ă t1, . . . , pu and M ą 0,!
pĂXi,˚uq2 ^M : supppuq Ă A
)
is a VC
class of functions of ĂXi,˚ with no greater VC-dimension than 10|A| ` 10, i “ 1, ..., n.
Proof of Lemma 4.3. Denote the set of subgraphs of functions in!
ĂXi,˚u : supppuq Ă A)
as C1, the set of subgraphs of functions in!
´ĂXi,˚u : supppuq Ă A)
as C2, then V pC1q “
V pC2q “ |A| ` 2 by Lemma 2.6.15 of van der Vaart and Wellner [78], where V p¨q denote
the VC-dimension. Let C be the subgraph of functions in!
pĂXi,˚uq2 ^M : supppuq Ă A
)
,
then C Ă C1 X C2 and V pCq ď V pC1 X C2q. Denote N “ 10|A| ` 10. To prove Lemma 4.3,
we only need to show V pC1 X C2q ď N . Or equivalently,
maxx1,...,xN
∆N pC1 X C2, x1, ..., xN q ă 2N ,
where
∆N pC1 X C2, x1, ..., xN q “ #
C X tx1, ..., xNu : C P C1 X C2
(
.
In fact, we have
maxx1,...,xN
∆N pC1 X C2, x1, ..., xN q
ď`
maxx1,...,xN
∆N pC1, x1, ..., xN q˘`
maxx1,...,xN
∆N pC2, x1, ..., xN q˘
85
ď
¨
˝
V pC1q´1ÿ
j“0
ˆ
N
j
˙
˛
‚
¨
˝
V pC2q´1ÿ
j“0
ˆ
N
j
˙
˛
‚“
¨
˝
|A|`1ÿ
j“0
ˆ
N
j
˙
˛
‚
2
,
where the second inequality comes from Corollary 2.6.3 of van der Vaart and Wellner [78].
We need to prove´
ř|A|`1j“0
`
Nj
˘
¯2ă 2N . Let π “ |A|`1
N “ 110 . Since
|A|`1ÿ
j“0
ˆ
N
j
˙
“
|A|`1ÿ
j“0
ˆ
N
j
˙
πjp1´ πqN´j1
πjp1´ πqN´j
ď maxjď|A|`1
1
πjp1´ πqN´j
“1
π|A|`1p1´ πqN´|A|´1
“
ˆ
1
ππp1´ πq1´π
˙N
,
we only need to verify ππp1´ πq1´π ą 1?2
with π “ 110, which holds numerically. ˝
Lemma 4.4. Let ε ą 0 and s˚ be as in Theorem 4.1. Suppose ΣGj ,Gj “ Idjˆdj and
n ě Cs˚ log
eps˚(
for certain constant C. If (4.36) is uniformly integrable, then
P
#
inf! 1
n
nÿ
i“1
pĂXi,˚rvq2 : rv P CpGq
´
p1` εqξ, S,λ¯
, rvSc0 ď s˚, rvTΣrv “ 1)
ě 1´ ε
+
Ñ 1.
(4.78)
If (4.38) is uniformly integrable, then
P
#
inf! 1
n
nÿ
i“1
pĂXi,˚rvq2 : rv P C ˚pGq
´
p1` εqξ, S,λ¯
, rvSc0 ď s˚, rvTΣrv “ 1)
ě 1´ ε
+
Ñ 1.
(4.79)
Proof of Lemma 4.4. By Lemma 4.3, for any A Ă t1, . . . , pu with A Ą S and
|AzS| “ s˚,
!
pĂXi,˚rvq2 ^M : suppprvq Ă A
)
, i “ 1, ..., n
is a VC class of functions of ĂXi,˚ with VC-dimension no greater than 10|A| ` 10. Denote
this VC class as C. Let Q be any probability measure, 0 ă ε ă 1, Npε, C, ¨ q be the
covering number, the minimal number of balls tg : g ´ f ă εu of radius ε needed to cover
86
C. The norm considered here is the L2pQq norm: fQ “` ş
|f |2dQ˘12
, for any f P C. By
Theorem 2.6.4 of van der Vaart and Wellner [78], there exists a universal constant C, such
that
supQN
`
ε, C, L2pQq˘
ď CV pCqp4eqV pCqp1εq2V pCq´2.
It then follows from Theorem 2.14.9 of van der Vaart and Wellner [78] that
P
#
inf! 1
n
nÿ
i“1
“
pĂXi,˚rvq2 ^M ´ EpĂXi,˚rvq
2 ^M‰
: suppprvq Ă A)
ď ´ε
+
À CV pCq expt´2npεMq2u
with C being a positive constant. Since there are totally`
p´ss˚
˘
choices of A,
P
#
inf! 1
n
nÿ
i“1
“
pĂXi,˚rvq2 ^M ´ EpĂXi,˚rvq
2 ^M‰
: rvSc0 ď s˚)
ď ´ε
+
À
ˆ
p´ s
s˚
˙
CV pCq expt´2npεMq2u
ă`
eps˚˘s˚
CV pCq expt´2nε2M2u
À exp
C0s˚ logpeps˚q ´ 2nε2M2
(
holds for certain constant C0. The second inequality holds because V pCq ď 10ps` s˚q ` 10
and`
ab
˘
ď p e¨ab qb for any integer a ě b ą 0. Then, if n ě Cs˚ logpeps˚q holds for certain C,
P
#
inf! 1
n
nÿ
i“1
“
pĂXi,˚rvq2 ^M ´ EpĂXi,˚rvq
2 ^M‰
: rvSc0 ď s˚)
ď ´ε
+
Ñ 0. (4.80)
Moreover, since (4.36) is uniformly integrable, there exists M ą 0 such that for rv P
CpGqpξ1, S,λq, rvSc0 ď s˚, rvTΣrv “ 1,
E! 1
n
nÿ
i“1
pĂXi,˚rvq2 ^M
)
ě 1´ε
2. (4.81)
87
Combing (4.80) and (4.81), for ξ1 “ p1` εqξ, we have
P
#
inf! 1
n
nÿ
i“1
pĂXi,˚rvq2 : rv P CpGqpξ1, S,λq, rvSc0 ď s˚, rvTΣrv “ 1
)
ď 1´ ε
+
ď P!
inf! 1
n
nÿ
i“1
“
pĂXi,˚rvq2 ^M ´ EpĂXi,˚rvq
2 ^M‰
:
rv P CpGqpξ1, S,λq, rvSc0 ď s˚)
ď ´ε
2
)
goes to zero. Therefore, (4.78) holds. Moreover, (4.79) can be proved in the same way. ˝
4.8 Discussion
Although bootstrap method beyond the topic of this chapter, we do realize that our results
may provide a theoretical foundation for the bootstrapped penalized least-square estimator.
In bootstrapped estimation, one sample with replacement of size n,
pX˚i,˚, y
˚i q : i “ 1, ..., n
(
from the original data points
pĂXi,˚, yiq : i “ 1, ..., n(
and form the corresponding
design matrix X˚ and response vector y˚. Let Σ˚“ tX˚uTX˚n be the bootstrapped
Gram matrix. Then it can be shown that the RE condition holds on Σ˚
under the low
moment condition on random designs. In other words, the sample RE implies the RE
of bootstrapped sample as population RE implies the sample RE. Then, prediction and
coefficients estimation properties for the bootstrapped Lasso can be guaranteed under the
low moment condition.
In Section 4.2, we argued that the RSC condition can be viewed as an RE-type condition
with a slightly larger cone. Now we provide the detailed results and proofs.
Proposition 4.1. The restricted strong convexity condition (4.13) is equivalent to the RE-
type condition
uTΣu ě
$
’
’
&
’
’
%
αuA22, u2 ď 1
αuA2, u2 ě 1
(4.82)
88
in the union cone
ď
|A|ďCn log p
!
u : uAc1 ď ξa
|A|uA2
)
(4.83)
for certain positive constants α and C.
Remark 4.5. In the linear model, one only need to check the RE for u2 “ 1, (4.82) is
equivalent to κ2pΣ;A, ξq ě α for u in the union cone (4.83).
Proof of Proposition 4.1. RSC to RE: Let α0 ą 0, 0 ď τ0 ď 1 be fixed numbers
satisfying
a
τ1α1 _ pτ2α2q ď νa
τ0α0
for 0 ă ν ă 1. Let A be any set satisfying p1 ` ξqa
|A|pτ0α0qplog pqn ď 1. When
uAc1 ď ξ|A|12uA2, we have
a
pτ0α0qplog pqnu1 ď p1` ξqa
|A|pτ0α0qplog pqnuA2 ď uA2.
The RSC condition (4.13) implies
uTΣu ě
$
’
’
&
’
’
%
p1´ νqα1uA22, u2 ď 1
p1´ νqα2uA2, u2 ě 1.
Therefore, the RE condition (4.82) holds with α “ p1´ νqmintα1, α2u in the union cone
ď
p1`ξq?pταqplog pqnď1|A|
!
u : uAc1 ď ξa
|A|uA2
)
.
RE to RSC: Now suppose (4.82) holds with |A| “ k, and α, τ be positive numbers
satisfying
1pξ?kq ` 1
?4k ď
a
pταqplog pqn.
89
We only need to consider u satisfying
a
pταqplog pqnu1 ď u2,
otherwise the RSC condition (4.13) holds automatically. Let A˚ be the index set of the k
largest elements of |u|. As
uAc˚2 ď u1
?4k,
we have
u1pξ?kq ď
a
pταqplog pqnu1 ´ u1?
4k ď u2 ´ uAc˚2 ď uA˚2.
Thus u is in the cone tu : uAc˚1 ď ξ
a
|A˚|uA˚2u. Then, for u2 ď 1,
uTΣu ě αuA˚22 ě αuA˚
22 ` αuAc
˚22 ´ αu
21t4ku ě αu22 ´ τplog pqnu21.
Similarly, for u2 ě 1,
uTΣu ě αuA˚2 ě αuA˚2 ` αuAc˚2 ´ αu1
?4k ě αu2 ´
a
τplog pqnu1.
Therefore, the RSC condition holds. ˝
90
Chapter 5
Nonparametric Maximum Likelihood for Mixture Models: A
Convex Optimization Approach to Fitting Arbitrary
Multivariate Mixing Distributions
5.1 Introduction
Consider a setting where we have iid observations from a mixture model. More specifically,
let G0 be a probability distribution on T Ď Rd and let tF0p¨ | θquθPT be a family of
probability distributions on Rn indexed by the parameter θ P T . Throughout the chapter,
we assume that T is closed and convex. Assume that X1, ..., Xp P Rn are observed iid
random variables and that Θ1, ...,Θp P Rd are corresponding iid latent variables, which
satisfy
Xj | Θj „ F0p¨ | Θjq and Θj „ G0. (5.1)
In (5.1), it may be the case that F0p¨ | θq and G0 are both known (pre-specified)
distributions; more frequently, this is not the case. In this chapter, we will study problems
where the mixing distribution G0 is unknown, but will assume F0p¨ | θq is known throughout.
Problems like this arise in applications throughout statistics, and various solutions have been
proposed. The distribution G0 can be modeled parametrically, which leads to hierarchical
modeling and parametric empirical Bayes methods [e.g. 18]. Another approach is to model
G0 as a discrete distribution supported on finitely- or infinitely-many points; this leads
to the study of finite mixture models or nonparametric Bayes, respectively [50, 23]. This
chapter focuses on another method for estimating G0: Nonparametric maximum likelihood.
Nonparametric maximum likelihood (NPML) methods for mixture models — and
closely related empirical Bayes methods — have been studied in statistics since the 1950s
[62, 38, 63]. They make virtually no assumptions on the mixing distribution G0 and
91
provide an elegant approach to problems like (5.1). The general strategy is to first find
the nonparametric maximum likelihood estimator for G0, denoted by G, then perform
inference via empirical Bayes [63, 18]; that is, inference in (5.1) is conducted via the posterior
distribution Θj | Xj , under the assumption G0 “ G. Research in to NPMLEs for mixture
models has included work on algorithms for computing NPMLEs and theoretical work on
their statistical properties [e.g. 41, 8, 43, 26, 36]. However, implementing and analyzing
NPMLEs for mixture models has historically been considered very challenging [e.g. p.
571 of 13, 17]. In this chapter, we study a computationally convenient approach involving
approximate NPMLEs, which sidesteps many of these difficulties and is shown to be effective
in a variety of applications.
Our approach is largely motivated by recent work initiated by [40](In fact, Koenker &
Mizera’s work was itself partially inpsired by relatively recent theoretical work on NPMLEs
by [36].) and further pursued by others, including [28, 30, 29] and [15]. Koenker & Mizera
studied convex approximations to NPMLEs for mixture models in relatively large-scale
problems, with up to 100,000s of observations. In [40], they showed that for the Gaussian
location model, where Xj “ Θj ` Zj P R and Θj „ G0, Zj „ Np0, 1q are independent, a
good approximation to the NPMLE for G0 can be accurately and rapidly computed using
generic interior point methods.
[40]’s focus on convexity and scalability is one of the key concepts for this chapter. Here,
we show how a simple convex approximation to the NPMLE can be used effectively in a
broad range of problems with nonparametric mixture models; including problems involving
(i) multivariate mixing distributions, (ii) discrete data, (iii) high-dimensional classification,
and (iv) state-space models. Backed by new theoretical and empirical results, we provide
concrete guidance for efficiently and reliably computing approximate multivariate NPMLEs.
Our main theoretical result (Proposition 5.1) suggests a simple procedure for finding the
support set of the estimated mixing distribution. Many of our empirical results highlight
the benefits of using multivariate mixing distributions with correlated components (Sections
5.6.2, 5.7, and 5.8), as opposed to univariate mixing distributions, which have been the
primary focus of previous research in this area (notable exceptions include theoretical work
on the Gaussian location-scale model in [26] and applications in [30, 29] involving estimation
92
problems with Gaussian models). In Sections 5.7–5.9, we illustrate the performance of the
methods described here in real-data applications involving baseball, cancer microarray data,
and online blood-glucose monitoring for diabetes patients. In comparison with other recent
work on NPMLEs for mixture models, this chapter distinguishes itself from [28] in that it
focuses on more practical aspects of fitting general multivariate NPMLEs. Additionally,
in this chapter we consider a substantially broader swath of applications than [30, 29] and
[15], where the focus is estimation in Gaussian models and classification with a univariate
NPMLE, respectively, and show that the same fundamental ideas may be effectively applied
in all of these settings.
5.2 NPMLEs for mixture models via convex optimization
5.2.1 NPMLEs
Let GT denote the class of all probability distribution on T Ď Rd and suppose that f0p¨ | θq is
the probability density corresponding to F0p¨ | θq (with respect to some given base measure).
For G P GT , the (negative) log-likelihood given the data X1, ..., Xp is
`pGq “ ´1
p
pÿ
j“1
log
"ż
Tf0pXj | θq dGpθq
*
.
The Kiefer-Wolfowitz NPMLE for G0 [38], denoted G, solves the optimization problem
minGPGT
`pGq; (5.2)
in other words, `pGq “ minGPGT `pGq.
Solving (5.2) and studying properties of G forms the basis for basically all of the
existing research into NPMLEs for mixture models (including this chapter). Two important
observations have had significant but somewhat countervailing effects on this research:
(i) The optimization problem (5.2) is convex;
(ii) If f0pXj |θq and T satisfy certain (relatively weak) regularity conditions, then G exists
and may be chosen so that it is a discrete measure supported on at most p points.
93
The first observation above is obvious; the second summarizes Theorems 18–21 of [43].
Among the more significant regularity conditions mentioned in (ii) is that the set
tf0pXj |θquθPT should be bounded for each j “ 1, ..., p.
Observation (i) leads to KKT-like conditions that characterize G in terms of the gradient
of ` and can be used to develop algorithms for solving (5.2) [e.g. 43]. While this approach
is somewhat appealing, (5.2) is typically an infinite-dimensional optimization problem
(whenever T is infinite). Hence, there are infinitely many KKT conditions to check, which
is generally impossible in practice.
On the other hand, observation (ii) reduces (5.2) to a finite-dimensional optimization
problem. Indeed, (ii) implies that G can be found by restricting attention in (5.2) to G P Gp,
where Gp is the set of discrete probability measures supported on at most p points in T .
Thus, finding G is reduced to fitting a finite mixture model with at most p components.
This is usually done with the EM-algorithm [41], where in practice one may restrict to
G P Gq for some q ă p. However, while (ii) reduces (5.2) to a finite-dimensional problem,
we have lost convexity:
minGPGq
`pGq (5.3)
is not a convex problem because Gq is nonconvex. When q is large (and recall that the
theory suggests we should take q “ p), well-known issues related to nonconvexity and finite
mixture models become a significant obstacle [50].
5.2.2 A simple finite-dimensional convex approximation
In this chapter, we take a very simple approach to (approximately) solving (5.2), which
maintains convexity and immediately reduces (5.2) to a finite-dimensional problem.
Consider a pre-specified finite grid Λ Ď T . We study estimators GΛ, which solve
minGPGΛ
`pGq. (5.4)
The key difference between (5.3) and (5.4) is that GΛ, and hence (5.4), is convex, while Gq
is nonconvex. Additionally, (5.4) is a finite-dimensional optimization problem, because Λ is
finite.
94
To derive a more convenient formulation of (5.4), suppose that
Λ “ tt1, ..., tqu Ď T (5.5)
and define the simplex ∆q´1 “ tw “ pw1, ..., wqq P Rq; wl ě 0, w1 ` ¨ ¨ ¨ ` wq “ 1u.
Additionally, let δt denote a point mass at t P Rd. Then there is a correspondence between
G “řqk“1wkδtk P GΛ and points w “ pw1, ..., wqq P ∆q´1. It follows that (5.4) is equivalent
to the optimization problem over the simplex,
minwP∆q´1
´1
p
pÿ
j“1
log
#
qÿ
k“1
f0pXj |tkqwk
+
. (5.6)
Researchers studying NPMLEs have previously considered estimators like GΛ, which
solve (5.4)–(5.6). However, most have focused on relatively simple models with univariate
mixing distributions G0 [8, 36, 40]. In very recent work, [29, 30] have considered multivariate
NPMLEs for estimation problems involving Gaussian models — by contrast, our aim is to
formulate strategies for solving and implementing the general problem, as specified in (5.2),
(5.4), and (5.6).
5.3 Choosing Λ
The approximate NPMLE GΛ is the estimator we use throughout the rest of the chapter.
One remaining question is: How should Λ be chosen? Our perspective is that GΛ is an
approximation to G and its performance characteristics are inherited from G. In general,
GΛ ‰ G. However, as one selects larger and larger finite grids Λ Ď T , which are more and
more dense in T , evidently GΛ Ñ G. Thus, heuristically, as long as the grid Λ is “dense
enough” in T , GΛ should perform similarly to G.
If T is compact, then any regular grid Λ Ď T is finite and implementing (5.4) is
straightforward (specific implementations are discussed in Section 5.5). Thus, for compact
T , one can choose Λ to be a regular grid with as many points as are computationally
feasible. For general T , we propose a two-step approach to choosing Λ: (i) Find a compact
95
convex subset T0 Ď T so that (5.2) is equivalent (or approximately equivalent) to
infGPGT0
`pGq; (5.7)
(ii) choose Λ Ď T0 Ď T to be a regular grid with q points, for some sufficiently large q.
Empirical results seem to be fairly insensitive to the choice of q. In Sections 5.7–5.9, we
choose q “ 30d for models with d “ 2, 3 dimensional mixing distributions G. For some
simple models with univariate G (d “ 1), theoretical results suggest that if q “?p, then
GΛ is statistically indistinguishable from G [15].
For each j “ 1, . . . , p, define
θj “ θpXjq “ arg maxθPT
f0pXj | θq
to be the maximum likelihood estimator (MLE) for Θj , given the data Xj P Rn. The
following proposition implies that (5.2) and (5.7) are equivalent when the likelihoods f0pXj |
θq are from a class of elliptical unimodal distributions, and T0 “ convpθ1, . . . , θpq is the
convex hull of θ1, . . . , θp. This result enables us to employ the strategy described above for
choosing Λ and finding GΛ; specifically, we take Λ to be a regular grid contained in the
compact convex set convpθ1, . . . , θpq.
Proposition 5.1. Suppose that f0 has the form
f0pXj | θq “ h
pθj ´ θqJΣ´1pθj ´ θq
(
upXjq, (5.8)
where h : r0,8q Ñ r0,8q is a decreasing function, Σ is a pˆ p positive definite matrix, and
u : Rn Ñ R is some other function that does not depend on θ. Let T0 “ convpθ1, . . . , θpq.
Then `pGq “ infGPGT0`pGq.
Proof. Assume that G “řqk“1wkδtk , where t1, . . . , tq P T and w1, . . . , wq ą 0. Further
assume that tq R T0 “ convpθ1, . . . , θkq. We show that their is another probability
distribution G “řq´1k“1wkδtk ` wqδtq , with tq P T0, satisfying `pGq ď `pGq. This suffices to
prove the proposition.
96
Let tq be the projection of tq onto T0 with respect to the inner product ps, tq ÞÑ sJΣ´1t.
To prove that `pGq ď `pGq, we show that f0pXj | tqq ě f0pXj | tqq for each j “ 1, . . . , p.
We have
pθj ´ tqqJΣ´1pθj ´ tqq “ pθj ´ tq ` tq ´ tqq
JΣ´1pθj ´ tq ` tq ´ tqq
“ pθj ´ tqqJΣ´1pθj ´ tqq ` 2ptq ´ tqq
JΣ´1pθj ´ tqq
` ptq ´ tqqJΣ´1ptq ´ tqq
ě pθj ´ tqqJΣ´1pθj ´ tqq,
where we have used the fact that ptq ´ tqqJΣ´1pθj ´ tqq “ 0, because tq is the projection of
tq onto T0. By (5.8), it follows that f0pXj | tqq ě f0pXj | tqq, as was to be shown.
The condition (5.8) is rather restrictive, but we believe it applies in a number of
important problems. The fundamental example where (5.8) holds is Xj | Θj „ NpΘj ,Σq; in
this case θj “ Xj and (5.8) holds with u being certain constant and hpzq9e´z2. Condition
(5.8) also holds in elliptical models, where Θj is the location parameter of Xj | Θj .
More broadly, if Xj “ pX1j , . . . , Xnjq P Rn may be viewed as a vector of replicates Xij ,
i “ 1, . . . , n, drawn from some common distribution conditional on Θj , then standard results
suggest that the MLEs θj may be approximately Gaussian if n is sufficiently large, and (5.8)
may be approximately valid. Specific applications where a normal approximation argument
for θj may imply that (5.8) is approximately valid include count data (similar to Section
5.7) and time series modeling (Section 5.9).
5.4 Connections with finite mixtures
Finding GΛ is equivalent to fitting a finite-mixture model, where the locations of the atoms
for the mixing measure have been pre-specified (specifically, the atoms are taken to be the
points in Λ). Thus, the approach in this chapter reduces computations for the relatively
complex nonparametric mixture model (5.1) to a convex optimization problem that is
substantially simpler than fitting a standard finite mixture model (generally a non-convex
problem).
97
An important distinction of nonparametric mixture modelss is that they lack the built-in
interpretability of the components/atoms from finite mixture models, and are less suited
for clustering applications. On the other hand, taking the nonparametric approach provides
additional flexibility for modeling heterogeneity in applications where it is not clear that
there should be well-defined clusters. Moreover, post hoc clustering and finite mixture
model methods could still be used after fitting an NPMLE; this might be advisable if, for
instance, GΛ has several clearly-defined modes.
5.5 Implementation overview
A variety of well-known algorithms are available for solving (5.6) and finding GΛ. We’ve
experimented with several, including the EM-algorithm, interior point methods, and the
Frank-Wolfe algorithm. This section contains a brief overview of how we’ve implemented
these algorithms; numerical results comparing the algorithms are contained in the following
section.
One of the early applications of the EM-algorithm is mixture models [41]. Solving (5.6)
with the EM-algorithm for mixture models is especially simple, because the problem is
convex (recall that the finite mixture model problem — as opposed to the nonparametric
problem — is typically non-convex). [40] have developed interior point methods for solving
(5.6). Along with [39], they created an R package REBayes that solves (5.6) for a handful
of specific nonparametric mixture models, e.g. Gaussian mixtures and univariate Poisson
mixtures; the REBayes packages calls an external optimization software package, Mosek,
and relies on Mosek’s built-in interior point algorithms. In our numerical analyses, we
used REBayes to compute some one-dimensional NPMLEs with interior point methods.
To estimate multi-dimensional NPMLEs with interior point methods, we used our own
implementations based on another R package Rmosek [3] (REBayes does appear to have
some built-in functions for estimating two-dimensional NPMLEs, but we found them to be
somewhat unstable in our applications). We note that our interior point implementation
solves the primal problem (5.6), while REBayes solves the dual. The Frank-Wolfe algorithm
[24] is a classical algorithm for constrained convex optimization problems, which has recently
been the subject of renewed attention [e.g. 35]. Our implementation of the Frank-Wolfe
98
algorithm closely resembles the “vertex direction method,” which has previously been used
for finding the NPMLE in nonparametric mixture models [7].
All of the algorithms used in this chapter were implemented in R. We did not attempt
to heavily optimize any of these implmentations; instead, our main objective was to
demonstrate that there are a range of simple and effective methods for finding (approximate)
NPMLEs. While the REBayes and Rmosek packages were used for their interior point
methods, no packages beyond base R were required for any of our other implementations.
5.6 Simulation studies
This section contains simulation results for NPMLEs and a Gaussian location-scale mixture
model. Section 5.6.1 contains a comparison of the various NPMLE algorithms described
in the previous section. In Section 5.6.2, we compare the performance of NPMLE-based
estimators to other commonly used methods for estimating the mean in a Gaussian location-
scale model.
In all of the simulations described in this section we generated the data as follows. For
j “ 1, . . . , p, we generated independent Θj “ pµj , σjq „ G0 and corresponding observations
Xj P Rn. Each Xj “ pX1j , . . . , Xnjq1 was a vector of n replicates
X1j , . . . , Xnj | Θj „ Npµj , σ2j q (5.9)
that were generated independently, conditional on Θj . In the general model (5.9), the mixing
distribution G0 is bivariate. However, we considered two values of G0 in our simulations:
one where the marginal distribution of σj was degenerate (i.e. σj was constant, so G0 was
effectively univariate) and one where the marginal distribution of σj was non-degenerate.
(Note that for non-degenerate σj , n ě 2 replicates are essential in order to ensure that the
likelihood is bounded and that G exists.)
Throughout the simulations, we took p “ 1000 and n “ 16. For the first mixing
distribution G0 (degenerate σj), we fixed σj “ 4 and took µj so that Ppµj “ 0q “ Ppµj “
5q “ 12. For the second mixing distribution (non-degenerate σj), we took Ppµj “ 0, σj “
5q “ Ppµj “ 5, σj “ 3q “ 0.5; for this distribution µj and σj are correlated.
99
5.6.1 Comparing NPMLE algorithms
For each mixing distribution, we computed GΛ using the algorithms described in Secion 5.5:
the EM-algorithm, an interior point method with Rmosek, and the Franke-Wolfe algorithm.
For each of these algorithms, we also computed GΛ for various grids Λ. Specifically, we
considered regular grids Λ “ tmkuq1k“1ˆtsku
q2k“1 Ď rmin µj ,max µjsˆrmin σj ,max σjs Ď R2,
where µj “ n´1ř
iXij and σ2j “ n´1
ř
ipXij ´ µjq2. The values q1, q2 determine number
of grid-points in Λ for µj , σj , respectively, and in the simulations we fit estimators with
pq1, q2q “ p30, 30q, p50, 50q, and p100, 100q.
In addition to fitting the two-dimensional NPMLEs GΛ described above, for the
simulations with degenerate σj we also fit one-dimensional NPMLEs to the data µ1, . . . , µp,
according to the model
µj | µj „ Npµj , 1q.
We fit these one-dimensional NPMLEs using all of the same algorithms for the two-
dimensional NPMLEs (EM, interior point with Rmosek, and Frank-Wolfe), and we also used
the REBayes interior point implementation to estimate the distribution of µj in this setting.
For the one-dimensional NPMLEs, we took Λ Ă rmin µj ,max µjs Ď R to be the regular grid
mith q “ 300 points. This allows us to compare the performance of methods for one- and
two-dimensional NPLMEs (where the one-dimensional NPMLEs take the distribution of σj
to be known) and compare the performance of the two interior point algorithms, among
other things.
For each simulated dataset and estimator GΛ we recorded several metrics. First, we
computed the total squared error (TSE),
TSE “
pÿ
j“1
tEGΛpµj | Xjq ´ µju
2.
Second, we computed the difference between the log-likelihood of GΛ and the log-likelihood
of GEM, the corresponding estimator for G0 based on the EM-algorithm:
∆plog-lik.q “ `pGEMq ´ `pGΛq.
100
Note that ∆plog-lik.q ą 0 if GΛ has a smaller negative log-likelihood than GEM (we’re
taking the EM-estimator GEM as a baseline for measuring the log-likelihood). Finally, we
recorded the time required to compute GΛ (in seconds; all calculations performed on a 2015
MacBook Pro laptop). Summary statistics are reported in Table 5.1.
It is evident that the results in Table 5.1 are relatively insensitive to the number of
grid-points pq1, q2q chosen for the two-dimensional NPMLE implementations. In terms of
TSE, the EM algorithm and interior point methods perform very similarly across all of the
settings, while the interior point methods appear to slightly out-perform the EM algorithm
in terms of ∆plog-lik.q across the board. Additionally, the interior point methods have
smaller compute time than the EM algorithm, though the difference is not too significant
for applications at this scale (for mixing distribution 1, with degenerate σj , the REBayes
dual implementation appears to be somewhat fast than our Rmosek primal implementation).
The Frank-Wolfe algorithm is the fastest implementation we have considered, but it’s
performance in terms of TSE and ∆plog-lik.q is considerably worse than the EM algorithm or
interior point methods. In the remainder of the chapter, we chose to use the EM algorithm
exclusively for computing NPMLEs — we believe it strikes a balance between simplicity
and performance.
5.6.2 Gaussian location scale mixtures: Other methods for estimating a
normal mean vector
Beyond NPMLEs, we also implemented several other methods that are commonly used
for estimating the mean vector µ “ pµ1, . . . , µpq1 P Rp in Gaussian location-scale models
and computed the corresponding TSE. Specifically, we considered the fixed-Θj MLE, µ “
pµ1, . . . , µpq1 P Rp; the James-Stein estimator; the heteroscedastic SURE estimator of Xie,
Kou, and Brown [81]; and a soft-thresholding estimator. The James-Stein estimator is a
classical shrinkage estimator for the Gaussian location model. The version employed here
is described in [81] and is designed for heteroscedastic data. The heteroscedastic SURE
estimator is another shrinkage estimator, which was designed to ensure that features with
a high noise variance are “shrunk” more than those with a low noise variance. Both the
James-Stein estimator and the heteroscedastic SURE estimator depend on the values σj .
101
The soft-thresholding estimator takes the form µpXq “ stpµjq, where t ě 0 is a constant
and stpxq “ signpxqmaxt|x|´ t, 0u, x P R. For soft-thresholding estimators, t was chosen to
minimize the TSE. Observe that the James-Stein, SURE, and soft-thresholding estimators
all depend on information that is typically not available in practice: the value of σj and
the actual TSE. By contrast, the two-dimensional NPMLEs described in the previous sub-
section utilize only the observed data X1, . . . , Xp.
In Table 5.2, we report the TSE for the different estimators described in this section,
along with the TSE for the bivariate NPMLE fit using the EM algorithm. We also fit a
univariate NPMLE in this example, where σj was not assumed to be known; instead we used
the plug-in estimator σj in place of σj and then computed the NPMLE for the distribution
of µj .
Table 5.2 shows that the NPMLEs dramatically out-perform the alternative estimators
in this setting, in terms of TSE. The bivariate NPMLE out-performs the univariate NPMLE
under both mixing distributions 1 and 2, but its advantage is especially pronounced under
mixing distribution 2, where µj and σj are correlated. This highlights the potential
advantages of bivariate NPMLEs over univariate approaches in settings with multiple
parameters.
5.7 Baseball data
Baseball data is a well-established testing ground for empirical Bayes methods [19]. The
baseball dataset we analyzed contains the number of at-bats and hits for all of the Major
League Baseball players during the 2005 season and has been previously analyzed in a
number of papers [10, 37, 53, 81]. The goal of the analysis is to use the data from the first
half of the season to predict each player’s batting average (hits/at-bats) during the second
half of the season. Overall, there are 929 players in the baseball dataset; however, following
[10] and others, we restrict attention to the 567 players with more than 10 at-bats during
the first half of the season (we follow the other preprocessing steps described in [10] as well).
Let Aj and Hj denote the number of at-bats and hits, respectively, for player j during the
first half of the season. We assume that pAj , Hjq follows a Poisson-binomial mixture model,
102
where Aj | pλj , πjq „ Poissonpλjq, Hj | pAj , λj , πjq „ binomialpAj , πiq, and pλj , πjq „ G0.
This model has a bivariate mixing distribution distribution G0, i.e. d “ 2. In the notation
of (5.1), Xj “ pAj , Hjq and Θj “ pλj , πjq. We propose to estimate each player’s batting
average for the second half of the season by the posterior mean of π, computed under
pλ, πq „ GΛ,
πj “ EGΛpπj | Aj , Hjq. (5.10)
Most previously published analyses of the baseball data begin by transforming the data
via the variance stabilizing transformation
Wj “ arcsin
d
Hj ` 14
Aj ` 12(5.11)
(Muralidharan [53] is a notable exception). Under this transformation, Wj is approximately
distributed as Ntµj , p4Ajq´1u, where µj “ arcsin
?πj . Methods for Gaussian observations
may be applied to the transformed data, with the objective of estimating µj . Following
this approach, a variety of methods based on shrinkage, the James-Stein estimator, and
parametric empirical Bayes methods for Gaussian data have been proposed and studied
[10, 37, 81].
Under the transformation (5.11), it is standard to use total squared error to measure the
performance of estimators µj [e.g. 10]. In this example, the total squared error is defined
as
TSE “ÿ
j
pµj ´ Wjq2 ´
1
4Nj
where
Wj “ arcsin
d
Hj ` 14
Aj ` 12,
and Aj and Hj denote the at-bats and hits from the second half of the season, respectively.
For convenience of comparison, we used TSE to measure the performance of our estimates
πj , after applying the transformation µj “ arcsina
πj .
Results from the baseball analysis are reported in Table 5.3. Following the work of
others, we have analyzed all players from the dataset together, and then the pitchers and
103
non-pitchers from the dataset separately. In addition to our Poisson-binomial NPMLE-
based estimators (5.10), we considered six other previously studied estimators:
1. The (fixed-parameter) MLE estimator µj “ Wj uses each player’s hits and at-bats
from the first half of the season to estimate their performance in the second half.
2. The grand mean µj “ p´1pW1 ` ¨ ¨ ¨ ` Wpq gives the exact same estimate for each
player’s performance in the second half of the season, which is equal to the average
performance of all players during the first half.
3. The James-Stein parametric empirical Bayes estimator described in [10].
4. The weighted generalized MLE (weighted GMLE), which uses at-bats as a covariate
[37]. This is essentially a univariate NPMLE-method for Gaussian models with
covariates.
5. The semiparametric SURE estimator is a flexible shrinkage estimator that may be
viewed as a generalization of the James-Stein estimator [81].
6. The binomial mixture method in [53] is another empirical Bayes method, which does
require the data to be transformed and estimates πj directly (in [53], they work
conditionally on the at-bats Aj). TSE is computed after applying the arcsin?¨
transformation.
The values reported Table 5.3 are the TSEs of each estimator, relative to the TSE of
the fixed-parameter MLE. Our Poisson-binomial method performs very well, recording the
minimum TSE when all of the data (pitchers and non-pitchers) are analyzed together and for
the non-pitchers. Moreover, the Poisson-binomial NPMLE GΛ works on the original scale of
the data (no normal tranformation is required) and may be useful for other purposes, beyond
just estimation/prediction. Figure 5.1 (a) contains a histogram of 20,000 independent draws
from the estimated distribution of pAj , HjAjq, fitted with the Poisson-binomial NPMLE
to all players in the baseball dataset. Observe that the distribution appears to be bimodal.
By comparing this histogram with histograms of the observed data from the non-pitchers
and pitchers separately (Figure 5.1 (b)–(c)), it appears that the mode at the left of Figure
104
Figure 5.1: (a) Histogram of 20,000 independent draws from the estimated distribution ofpAj , HjAjq, fitted with the Poisson-binomial NPMLE to all players in the baseball dataset;(b) histogram of non-pitcher data from the baseball dataset; (c) histogram of pitcher datafrom the baseball dataset.
5.1 (a) represents a group of players that includes the pitchers and the mode at the right
represents the bulk of the non-pitchers.
5.8 Two-dimensional NPMLE for cancer microarray classification
[15] proposed a univariate NPMLE-based method for high-dimensional classification
problems and studied applications involving cancer microarray data. The classifiers from
[15] are based on a Gaussian model with one-dimensional mixing distributions, i.e. d “ 1.
In this section we show that using a bivariate mixing distribution may substantially improve
performance.
Two datasets from the Microarray Quality Control Phase II project [49] are considered;
one from a breast cancer study and one from a myeloma study. The training dataset for
the breast cancer study contains n “ 130 subjects and p “ 22283 probesets (genes); the
test dataset contain 100 subjects. The training dataset for the myeloma study contains
n “ 340 subjects and p “ 54675 probesets; the test dataset contains 214 subjects. The goal
is to use the training data to build binary classifiers for several outcomes, then check the
performance of these classifiers on the test data. Outcomes for the breast cancer data are
response to treatment (“Response”) and estrogen receptor status (“ER status”); outcomes
for the myeloma data are overall and event-free survival (“OS” and “EFS”).
For each of the studies, let Xij denote the expression level of gene j in subject i and let
Yi P t0, 1u be the class label for subject i. Let Xj “ pX1j , . . . , XnjqJ P Rn. We assume that
each class (k “ 0, 1) and each gene (j “ 1, . . . , p) has an associated mean expression level
105
µjk P R, and that conditional on the Yi and µjk all of the Xij are independent and Gaussian,
satisfying Xij | pYi “ k, µjkq „ Npµjk, 1q (the gene-expression levels in the datasets are all
standardized to have variance 1).
In [15], they assume that µ1k, . . . , µpk „ Gk (k “ 0, 1) are all independent draws
from two distributions, G0 and G1. They use the training data from classes k “ 0, 1 to
separately estimate the distributions G0 and G1 using NPMLEs, and then implement the
Bayes classifier, replacing G0 and G1 with the corresponding estimates. In this chapter, we
model Θj “ pµj0, µj1q „ G0 jointly, then compute the bivariate NPMLE GΛ, and finally use
GΛ in place of G0 in the Bayes classifier for this model. The model from [15] is equivalent
to the model proposed here, when µj0 and µj1 are independent. Results from analyzing
the MAQC datasets using these two classifiers (the previously proposed method with 1-
dimensional NPMLEs and the 2-dimensional NPMLE described here), along with some
other well-known and relevant classifiers, may be found in Table 5.4. The other classifiers
we considered were:
1. NP EBayes w/smoothing. Another nonparametric empirical Bayes classifier proposed
in [27], which uses nonparametric smoothing to fit a univariate density to the µj and
then employs a version of linear discriminant analysis
2. Regularized LDA. A version of `1-regularized linear discriminant analysis, proposed in
Mai, Zou, and Yuan [48].
3. Logistic lasso. `1-penalized logistic regression fit using the R package glmnet.
For each of the datasets and outcomes, the 2-dimensional NPMLE classifier substantially
outperforms the 1-dimensional NPMLE, and is very competitive with the top performing
classifiers. Modeling dependence between µj0 and µj1, as with the 2-dimensional NPMLE,
seems sensible because most of the genes are likely to have similar expression levels across
classes, i.e. µj0 and µj1 are likely to be correlated. This may be interpreted as a kind
of sparsity assumption on the data, which are prevalent in high-dimensional classification
problems. Moreover, our proposed method involving NPMLEs should adapt to non-sparse
settings as well, since G0 is allowed to be an arbitrary bivariate distribution.
106
One of the main underlying assumptions of the NPMLE-based classification methods is
that the different genes have independent expression levels. This is certainly not true in
most applications, but is similar in principle to a “naive Bayes” assumption. Developing
methods for NPMLE-based classifiers to better handle correlation in the data may be of
interest for future research.
5.9 Continuous glucose monitoring
The analysis in this section is based on blood glucose data from a study involving 137 type
1 diabetes patients; more details on the study may be found in [31, 16]. Subjects in the
study were monitored for an average of 6 months each. Throughout the course of the study,
each subject wore a continuous glucose monitoring device, built around an electrochemical
glucose biosensor. Every 5 minutes while in use, the device records (i) a raw electrical
current measurement from the sensor (denoted ISIG), which is known to be correlated with
blood glucose density, and (ii) a timestamped estimate of blood glucose density (CGM),
which is based on a proprietary algorithm for converting the available data (including the
electrical current measurements from the sensor) into blood glucose density estimates. In
addition to using the sensors, each study subject maintained a self-monitoring routine,
whereby blood glucose density was measured approximately 4 times per day from a small
blood sample extracted by fingerstick. Fingerstick measurements of blood glucose density
are considered to be more accurate (and are more invasive) than the sensor-based estimates
(e.g. CGM). During the study, the result of each fingerstick measurement was manually
entered into the continuous monitoring device at the time of measurement; algorithms for
deriving continuous sensor-based estimates of blood glucose density, such as CGM, may use
the available fingerstick measurements for calibration purposes.
In the rest of this section, we show how NPMLE-based empirical Bayes methods can
be used to improve algorithms for estimating blood glucose density using the continuous
monitoring data. The basic idea is that after formulating a statistical model relating blood
glucose density to ISIG, we allow for the possibility that the model parameters may differ for
each subject, then use a training dataset to estimate the distribution of model parameters
across subjects (i.e. estimate G0) via nonparametric maximum likelihood. This is illustrated
107
for two different statistical models in Sections 5.9.1–5.9.2.
Throughout the analysis below, we use fingerstick measurements as a proxy for the actual
blood glucose density values. Let FSjptq and ISIGjptq denote the fingerstick blood glucose
density and ISIG values, respectively, for the j-th subject at time t. Recall that FSjptq
is measured, on average, once every 6 hours, while ISIGjptq is available every 5 minutes.
Let Ft denote the σ-field of information available at time t (i.e. all of the fingerstick and
ISIG measurements taken before time t, plus ISIGjptq). For each methodology, we use the
first half of the available data for each subject to fit a statistical model relating ISIGjptq
to FSjptq, then estimate each value FSjptq in the second half of the data using xFSjptq, an
estimator based on Ft. The performance of each method is measured by the average MSE
on the test data, relative to the MSE of the proprietary estimator CGM.
5.9.1 Linear model
First we consider a basic linear regression model relating FS and ISIG,
FSjptq “ µj ` βjISIGjptq ` σjεjptq, (5.12)
where the εjptq are iid Np0, 1q random variables and Θj “ pµj , βj , logpσjqq P R3 are
unknown, subject-specific parameters. Three ways to fit (5.12) are (i) using the combined
model, where Θj “ Θ “ pµ, β, logpσqq for all j, i.e. all the subject-specific parameters are
the same; (ii) the individual model, where Θ1, . . . ,Θp are all estimated separately, from the
corresponding subject data; and (iii) the nonparametric mixture model, where Θj „ G0 are
iid draws from the d “ 3-dimensional mixing distribution G0. For each of these methods,
we took xFSjptq “ µj ` βjISIGptq, where µj and βj are the corresponding MLEs under the
combined and individual models, and, under the mixture model, µj “ EGΛpµj | Ftq and
βj “ EGΛpβj | Ftq. Results are reported in Table 5.5.
5.9.2 Kalman filter
Substantial performance improvements are possible by allowing the model parameters
relating FS and ISIG to vary with time. In this section we consider the Gaussian state
108
space model (Kalman filter)
FSjptiq “ αjptiqISIGjptiq ` σjεjpti´1q,
αjptiq “ αjpti´1q ` τjδjpti´1q,(5.13)
where we assume that FS along with ISIG are observed at times t1, . . . , tn and εjptq, δjptq „
Np0, 1q are iid. In (5.13), tαjptqu are the state variables that evolve according to a random
walk and Θj “ plogpτjq, logpσjqq are unknown parameters. Unlike (5.12), there is no
intercept term in (5.13); dropping the intercept term has been previously justified when
using state space models to analyze glucose sensor data [e.g. 16]. The parameters σj , τj
control how heavily recent observations are weighted when estimating αjptq.
Similar to the analysis in Section 5.9.1, we fit (5.13) using (i) a combined model where
Θj “ Θ for all j; (ii) an individual model where Θ1, . . . ,Θp are estimated separately; and (iii)
a nonparametric mixture model, where Θj „ G0 are iid draws from a d “ 2-dimensional
mixing distribution. Under (i)–(ii), σj and τj are estimated by maximum liklihood and
xFSjptiq “ Etαjptiq | Ftiu ˆ ISIGjptiq, where the conditional expectation is computed with
respect to the Gaussian law governed by (5.13), with σj and τj replacing σj and τj (i.e.
we use the Kalman filter). For the nonparametric mixture (iii), xFSjptiq “ EGΛtαjptiq |
Ftiu ˆ ISIGjptiq, where the expectation is computed with respect to the model (5.13) and
the estimated mixing distribution GΛ. Results are reported in Table 5.5.
5.9.3 Comments on results
From Table 5.5, it is evident that the NPMLE mixture approach outperforms the individual
and combined methods for both the linear model and the Kalman filter/state space model.
The Kalman filter methods perform substantially better than the linear model, highlighting
the importance of time-varying parameters (scientifically, this is justified because the
sensitivity of the glucose sensor is known to change over time). Note that all of the relative
MSE values in Table 5.5 are greater than 1, indicating that CGM still outperforms all of
the methods considered here. Somewhat more ad hoc methods for estimating blood glucose
density that do outperform CGM are described in [16]; these methods (and CGM) leverage
additional data available to the continuous monitoring system, which is not described here
109
for the sake of simplicity. The methods in [16] are somewhat similar to the “combined”
Kalman filtering method from Section 5.9.2, where Θj “ Θ for all j; it would be interesting
to see if the performance of these methods could be further improved by using NPMLE
ideas.
5.10 Discussion
We have proposed a flexible, practical approach to fitting general multivariate mixing
distributions with NPMLEs and illustrated the effectiveness of this approach through
several real data examples. Theoretically, we proved that the support set of the NPMLE
is the convex hull of MLEs when the likelihood F0 comes from a class of elliptical
unimodal distributions. We believe that this approach may be attractive for many problems
where mixture models and empirical Bayes methods are relevant, offering both effective
performance and computational simplicity.
110
Table 5.1: Comparison of different NPMLE algorithms. Mean values (standard deviationin parentheses) reported from 100 independent datasets; p “ 1000, throughout simulations.Mixing distribution 1 has constant σj ; mixing distribution 2 has correlated µj and σj .
TSE∆plog-likq
104Time(secs.)
Mixing dist. 1 EM(Bivariate) pq1, q2q “ p30, 30q 130.5 (42.6) 0 (0) 9
pq1, q2q “ p50, 50q 130.4 (42.7) 0 (0) 33pq1, q2q “ p100, 100q 130.4 (42.6) 0 (0) 136
Interior point (Rmosek)pq1, q2q “ p30, 30q 130.7 (42.6) 6 (1) 8pq1, q2q “ p50, 50q 130.5 (42.9) 9 (1) 20pq1, q2q “ p100, 100q 130.6 (42.8) 11 (1) 80
Frank-Wolfepq1, q2q “ p30, 30q 147.3 (45.1) -234 (130) 5pq1, q2q “ p50, 50q 147.0 (45.9) -238 (134) 14pq1, q2q “ p100, 100q 146.2 (45.4) -238 (128) 55
Mixing dist. 1 EM 124.4 (41.6) 0 (0) 1(univariate; q “ 300; Interior point (Rmosek) 124.3 (42.1) 6 (1) 3assume known σj) Interior point (REBayes) 124.3 (42.1) 6 (1) 1
Frank-Wolfe 126.1 (41.8) -4 (4) 1
Mixing dist. 2 EMpq1, q2q “ p30, 30q 54.0 (28.4) 0 (0) 9pq1, q2q “ p50, 50q 54.0 (28.9) 0 (0) 34pq1, q2q “ p100, 100q 53.9 (28.8) 0 (0) 141
Interior point (Rmosek)pq1, q2q “ p30, 30q 54.3 (28.8) 5 (1) 8pq1, q2q “ p50, 50q 54.2 (29.1) 8 (1) 20pq1, q2q “ p100, 100q 54.3 (29.1) 10 (1) 82
Frank-Wolfepq1, q2q “ p30, 30q 82.2 (39.3) -372 (217) 5pq1, q2q “ p50, 50q 83.1 (36.4) -402 (232) 14pq1, q2q “ p100, 100q 82.0 (37.7) -396 (240) 56
Table 5.2: Mean TSE for various estimators of µ P Rp based on 100 simulated datasets;p “ 1000. pq1, q2q indicates the grid points used to fit GΛ.
Method Mixing dist.1 Mixing dist.2
Fixed-Θj MLE 997.0 (48.2) 1059.3 (57.9)Soft-Thresholding 826.2 (50.0) 793.7 (46.0)James-Stein 859.7 (43.6) 935.2 (53.0)SURE 859.7 (43.6) 880.7 (48.1)Univariate NPMLE q “ 300 170.7 (47.0) 285.4 (63.5)Bivariate NPMLE pq1, q2q “ p100, 100q 130.4 (42.6) 53.9 (28.8)
111
Table 5.3: Baseball data. TSE relative to the naive estimator. Minimum error is in boldfor each analysis.
Non- Non-Method All Pitchers Pitchers Method All Pitchers Pitchers
Naive 1 1 1 SURE 0.41 0.26 0.08Grand mean 0.85 0.38 0.13 Binomial mixture 0.59 0.31 0.16James-Stein 0.53 0.36 0.16 NPMLE 0.27 0.25 0.13GMLE 0.30 0.26 0.14
Table 5.4: Microarray data. Number of misclassification errors on test data.Logistic
Dataset Outcome ntest 2d-NPMLE 1d-NPMLE EBayes LDA lasso
Breast Response 100 15 36 47 30 18Breast ER status 100 19 40 39 11 11Myeloma OS 214 30 55 100 97 27Myeloma EFS 214 34 76 100 63 32
Table 5.5: Blood glucose data. MSE relative to CGM.Linear model Kalman filter
Combined Individual NPMLE Combined Individual NPMLE
1.56 1.54 1.51 1.05 1.07 1.03
112
Bibliography
[1] A. Agarwal, S. Negahban, and M. J. Wainwright. Fast global convergence of gradientmethods for high-dimensional statistical recovery. Ann. Stat., 40:2452–2482, 2012.
[2] Anestis Antoniadis. Comments on: `1-penalization for mixture regression models. Test,19(2):257–258, 2010.
[3] MOSEK ApS. MOSEK Rmosek Package. Release 8.0.0.46, 2015. URL https://
mosek.com/resources/doc/.
[4] Pierre C Bellec, Guillaume Lecue, and Alexandre B Tsybakov. Slope meets lasso:improved oracle bounds and optimality. arXiv preprint arXiv:1605.08651, 2016.
[5] Alexandre Belloni, Victor Chernozhukov, and Lie Wang. Square-root lasso: pivotalrecovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011.
[6] Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis oflasso and dantzig selector. The Annals of Statistics, pages 1705–1732, 2009.
[7] D. Bohning. A review of reliable maximum likelihood algorithms for semiparametricmixture models. J. Stat. Plan. Infer., 47:5–28, 1995.
[8] D. Bohning, P. Schlattmann, and B. G. Lindsay. Computer-assisted analysis ofmixtures (CA MAN): Statistical algorithms. Biometrics, 48:283–303, 1992.
[9] Patrick Breheny and Jian Huang. Coordinate descent algorithms for nonconvexpenalized regression, with applications to biological feature selection. Annals of AppliedStatistics, 5:232–253, 2011.
[10] L. D. Brown. In-season prediction of batting averages: A field test of empirical Bayesand Bayes methodologies. Ann. Appl. Stat., 2:113–152, 2008.
[11] Emmanuel Candes and Terence Tao. The dantzig selector: statistical estimation whenp is much larger than n. The Annals of Statistics, pages 2313–2351, 2007.
[12] Emmanuel J Candes and Terence Tao. Decoding by linear programming. InformationTheory, IEEE Transactions on, 51(12):4203–4215, 2005.
[13] A. DasGupta. Asymptotic Theory of Statistics and Probability. Springer, 2008.
[14] Abhirup Datta and Hui Zou. Cocolasso for high-dimensional error-in-variablesregression. arXiv preprint arXiv:1510.07123, 2015.
[15] L. H. Dicker and S. D. Zhao. High-dimensional classification via nonparametricempirical Bayes and maximum likelihood inference. Biometrika, 103:21–34, 2016.
113
[16] L. H. Dicker, T. Sun, C.-H. Zhang, D. B. Keenan, and L. Shepp. Continuous bloodglucose monitoring: A bayes-hidden markov approach. Stat. Sinica, 23:1595–1627,2013.
[17] D. L. Donoho and G. Reeves. Achieving Bayes MMSE performance in the sparse signal+ Gaussian white noise model when the noise level is unknown. In IEEE Int. Symp.Inf. Theory, pages 101–105, 2013.
[18] B. Efron. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing,and Prediction. Cambridge University Press, 2010.
[19] B. Efron and C. Morris. Data analysis using Stein’s estimator and its generalizations.J. Am. Stat. Assoc., 70:311–319, 1975.
[20] Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani, et al. Least angleregression. The Annals of statistics, 32(2):407–499, 2004.
[21] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihoodand its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.
[22] Jianqing Fan, Han Liu, Qiang Sun, and Tong Zhang. Tac for sparse learning:Simultaneous control of algorithmic complexity and statistical error. arXiv preprintarXiv:1507.01037, 2015.
[23] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. Ann. Stat., 1:209–230, 1973.
[24] M. Frank and P. Wolfe. An algorithm for quadratic programming. Nav. Res. Log., 3:95–110, 1956.
[25] Jerome Friedman, Trevor Hastie, Holger Hofling, Robert Tibshirani, et al. Pathwisecoordinate optimization. The Annals of Applied Statistics, 1(2):302–332, 2007.
[26] S. Ghosal and A. W. Van der Vaart. Entropies and rates of convergence for maximumlikelihood and Bayes estimation for mixtures of normal densities. Ann. Stat., 29:1233–1263, 2001.
[27] E. Greenshtein and J. Park. Application of non parametric empirical Bayes estimationto high dimensional classification. J. Mach. Learn. Res., 10:1687–1704, 2009.
[28] J. Gu and R. Koenker. On a problem of Robbins. Int. Stat. Rev., 84:224–244, 2016.
[29] J. Gu and R. Koenker. Empirical Bayesball remixed: Empirical Bayes methods forlongitudinal data. J. Appl. Econom., 2016. To appear.
[30] J. Gu and R. Koenker. Unobserved heterogeneity in income dynamics: An empiricalBayes perspective. J. Bus. Econ. Stat., 2016. To appear.
[31] I. B. Hirsch, J. Abelseth, B. W. Bode, J. S. Fischer, F. R. Kaufman, J. Mastrototaro,C. G. Parkin, H. A. Wolpert, and B.A . Buckingham. Sensor-augmented insulin pumptherapy: Results of the first randomized treat-to-target study. Diabetes Technol. The.,10:377–383, 2008.
114
[32] Jian Huang and Cun-Hui Zhang. Estimation and selection via absolute penalizedconvex minimization and its multistage adaptive applications. Journal of MachineLearning Research, 13:1809–1834, 2012.
[33] Junzhou Huang and Tong Zhang. The benefit of group sparsity. The Annals ofStatistics, 38(4):1978–2004, 2010.
[34] P. J. Huber and E. M. Ronchetti. Robust statistics, pages 172–175. Wiley, secondedition, 2009.
[35] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. ICML2013, 28:427–435, 2013.
[36] W. Jiang and C.-H. Zhang. General maximum likelihood empirical Bayes estimationof normal means. Ann. Stat., 37:1647–1684, 2009.
[37] W. Jiang and C.-H. Zhang. Empirical Bayes in-season prediction of baseball battingaverages. In Borrowing Strength: Theory Powering Applications – A Festschrift forLawrence D. Brown, pages 263–273. Institute of Mathematical Statistics, 2010.
[38] J. Kiefer and J. Wolfowitz. Consistency of the maximum likelihood estimator in thepresence of infinitely many incidental parameters. Ann. Math. Stat., 27:887–906, 1956.
[39] R. Koenker and J. Gu. REBayes: An R package for empirical Bayes mixture methods,2016.
[40] R. Koenker and I. Mizera. Convex optimization, shape constraints, compounddecisions, and empirical Bayes rules. J. Am. Stat. Assoc., 109:674–685, 2014.
[41] N. Laird. Nonparametric maximum likelihood estimation of a mixing distribution. J.Am. Stat. Assoc., 73:805–811, 1978.
[42] Guillaume Lecue and Shahar Mendelson. Sparse recovery under weak momentassumptions. arXiv preprint arXiv:1401.2188, 2014.
[43] B. G. Lindsay. Mixture Models: Theory, Geometry, and Applications. IMS, 1995.
[44] Po-Ling Loh and Martin J Wainwright. High-dimensional regression with noisyand missing data: Provable guarantees with non-convexity. In Advances in NeuralInformation Processing Systems, pages 2726–2734, 2011.
[45] Po-Ling Loh and Martin J Wainwright. Support recovery without incoherence: A casefor nonconvex regularization. arXiv preprint arXiv:1412.5632, 2014.
[46] Po-Ling Loh and Martin J Wainwright. Regularized m-estimators with nonconvexity:Statistical and algorithmic theory for local optima. Journal of Machine LearningResearch, 16:559–616, 2015.
[47] Karim Lounici, Massimiliano Pontil, Sara Van De Geer, and Alexandre B Tsybakov.Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics,pages 2164–2204, 2011.
[48] Q. Mai, H. Zou, and M. Yuan. A direct approach to sparse discriminant analysis inultra-high dimensions. Biometrika, 99:29–42, 2012.
115
[49] MAQC Consortium. The microarray quality control (MAQC)-II study of commonpractices for the development and validation of microarray-based predictive models.Nat. Biotechnol., 28:827–838, 2010.
[50] G. McLachlan and D. Peel. Finite Mixture Models. John Wiley & Sons, 2004.
[51] Nicolai Meinshausen and Peter Buhlmann. High-dimensional graphs and variableselection with the lasso. The Annals of Statistics, pages 1436–1462, 2006.
[52] Ritwik Mitra and Cun-Hui Zhang. The benefit of group sparsity in group inferencewith de-biased scaled group lasso. Electronic Journal of Statistics, 10(2):1829–1873,2016.
[53] O. Muralidharan. An empirical Bayes mixture method for effect size and false discoveryrate estimation. Ann. Appl. Stat., 4:422–438, 2010.
[54] Yuval Nardi and Alessandro Rinaldo. On the asymptotic properties of the group lassoestimator for linear models. Electronic Journal of Statistics, 2:605–633, 2008.
[55] Sahand Negahban, Pradeep K Ravikumar, Martin J Wainwright, and Bin Yu. Aunified framework for high-dimensional analysis of m-estimators with decomposableregularizers. In Advances in Neural Information Processing Systems, pages 1348–1356,2009.
[56] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. Aunified framework for high-dimensional analysis of m-estimators with decomposableregularizers. Statistical Science, pages 538–557, 2012.
[57] Roberto Imbuzeiro Oliveira. The lower tail of random quadratic forms, withapplications to ordinary least squares and restricted eigenvalue properties. arXivpreprint arXiv:1312.2903, 2013.
[58] Michael R Osborne, Brett Presnell, and Berwin A Turlach. A new approach to variableselection in least squares problems. IMA Journal of Numerical Analysis-Institute ofMathematics and its Applications, 20(3):389–404, 2000.
[59] Michael R Osborne, Brett Presnell, and Berwin A Turlach. On the lasso and its dual.Journal of Computational and Graphical statistics, 9(2):319–337, 2000.
[60] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue propertiesfor correlated gaussian designs. The Journal of Machine Learning Research, 11:2241–2259, 2010.
[61] Stephen Reid, Robert Tibshirani, and Jerome Friedman. A study of error varianceestimation in lasso regression. Statistica Sinica, 26:35–67, 2016.
[62] H. E. Robbins. A generalization of the method of maximum likelihood: Estimating amixing distribution (abstract). Ann. Math. Stat., 21:314–315, 1950.
[63] H. E. Robbins. The empirical Bayes approach to statistical decision problems. In Proc.Third Berkeley Symp. on Math. Statist. and Prob., volume 1, pages 157–163, 1956.
[64] Mathieu Rosenbaum, Alexandre B Tsybakov, et al. Sparse recovery under matrixuncertainty. The Annals of Statistics, 38(5):2620–2651, 2010.
116
[65] Mathieu Rosenbaum, Alexandre B Tsybakov, et al. Improved matrix uncertaintyselector. In From Probability to Statistics and Back: High-Dimensional Models andProcesses–A Festschrift in Honor of Jon A. Wellner, pages 276–290. Institute ofMathematical Statistics, 2013.
[66] Mark Rudelson and Shuheng Zhou. Reconstruction from anisotropic randommeasurements. Information Theory, IEEE Transactions on, 59(6):3434–3447, 2013.
[67] Nicolas Stadler, Peter Buhlmann, and Sara van de Geer. `1-penalization for mixtureregression models. Test, 19(2):209–256, 2010.
[68] T. Sun and C.-H. Zhang. Comments on: `1-penalization for mixture regression models.Test, 19(2):270–275, 2010.
[69] Tingni Sun and Cun-Hui Zhang. Scaled sparse linear regression. Biometrika, pageass043, 2012.
[70] Tingni Sun and Cun-Hui Zhang. Sparse matrix inversion with scaled lasso. The Journalof Machine Learning Research, 14(1):3385–3418, 2013.
[71] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of theRoyal Statistical Society. Series B (Methodological), pages 267–288, 1996.
[72] Joel Tropp et al. Just relax: Convex programming methods for identifying sparsesignals in noise. Information Theory, IEEE Transactions on, 52(3):1030–1051, 2006.
[73] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations ofcomputational mathematics, 12(4):389–434, 2012.
[74] Sara van de Geer. The deterministic lasso. 2007.
[75] Sara van de Geer. The lasso with within group structure. In Nonparametrics andRobustness in Modern Statistical Inference and Time Series Analysis: A Festschrift inhonor of Professor Jana Jureckova, pages 235–244. Institute of Mathematical Statistics,2010.
[76] Sara van de Geer and Alan Muro. On higher order isotropy conditions and lowerbounds for sparse quadratic forms. Electronic Journal of Statistics, 8(2):3031–3061,2014.
[77] Sara A van de Geer and Peter Buhlmann. On the conditions used to prove oracleresults for the lasso. Electronic Journal of Statistics, 3:1360–1392, 2009.
[78] Aad W van der Vaart and Jon A Wellner. Weak Convergence. Springer, 1996.
[79] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsityrecovery using-constrained quadratic programming (lasso). Information Theory, IEEETransactions on, 55(5):2183–2202, 2009.
[80] Zhaoran Wang, Han Liu, and Tong Zhang. Optimal computational and statisticalrates of convergence for sparse nonconvex learning problems. Annals of statistics, 42(6):2164, 2014.
117
[81] X. Xie, S. C. Kou, and L. D. Brown. SURE estimates for a heteroscedastic hierarchicalmodel. J. Am. Stat. Assoc., 107:1465–1479, 2012.
[82] Fei Ye and Cun-Hui Zhang. Rate minimaxity of the lasso and dantzig selector for thelq loss in lr balls. Journal of Machine Learning Research, 11(Dec):3519–3540, 2010.
[83] Ming Yuan and Yi Lin. Model selection and estimation in regression with groupedvariables. Journal of the Royal Statistical Society: Series B (Statistical Methodology),68(1):49–67, 2006.
[84] Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty.The Annals of Statistics, pages 894–942, 2010.
[85] Cun-Hui Zhang and Jian Huang. The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics, pages 1567–1594, 2008.
[86] Tong Zhang. Some sharp performance bounds for least squares regression with l1regularization. The Annals of Statistics, 37(5A):2109–2144, 2009.
[87] Tong Zhang. Analysis of multi-stage convex relaxation for sparse regularization.Journal of Machine Learning Research, 11:1087–1107, 2010.
[88] Peng Zhao and Bin Yu. On model selection consistency of lasso. The Journal ofMachine Learning Research, 7:2541–2563, 2006.
[89] Hui Zou and Runze Li. One-step sparse estimates in nonconcave penalized likelihoodmodels. Annals of statistics, 36(4):1509, 2008.
Recommended