Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
of 1 27
conquer: Convolution Smoothed Quantile Regression
Wenxin Zhou
Joint work with Xuming He, Lan Wang, Kean Ming Tan & Xiaoou Pan
2021 WNAR Conference
of 2 27
‣ The idea of median regression predates the least squares method by about 50 years.
‣ Roger Joseph Boscovich (1760), Pierre-Simon Laplace (1789), F. Y. Edgeworth (1887).
‣ The method of least squares ( -method) has significant computational advantages over -method—minimization of absolute errors advocated by Boscovich, Laplace and others, even though the latter is more robust.
‣ Regression quantiles: Koenker and Bassett, Jr. (1978).
ℓ2
ℓ1
of 3 27
Computational Development
‣ Simplex-based algorithms: Barrodale & Roberts (1974), Koenker & d’Orey (1987).Slow in larger samples, worst-case analysis suggests the number of iterations may increase exponentially with the sample size.
‣ Interior point method: Newton-Frisch algorithm (Portnoy & Koenker, 1997) has computational complexity , conjectured to be improvable to (Mizuno-Todd-Ye conjecture).
‣ Preprocessing (Portnoy & Koenker, 1997): improved complexity , improvable to .
‣ R "quantreg", MATLAB, SAS, etc.
𝒪(n1+ap3 log n)(0 < a < 1/2)𝒪(np3 log2(n))
𝒪((np)2(1+a)/3p3 log n + np) 𝒪(n2/3p3 log2(n) + np)
of 4 27
Quantile regression versus meanregression
‣Mean regression: , .
‣ Quantile regression:
No strict distinction between ‘signal’ and ‘noise’.Object of interest: the conditional distribution of .Contains richer information than conditional mean.
yi = m(xi) + εi
𝔼(εi |xi) = 0
ℙ{y ≤ Q(τ, x) |x} = τ
y |x
of 5 27
Model and Assumptions
‣ Goal: learn the effect of a vector of covariates ( ) on the entire distribution of .
‣ Conditional quantile model: , .
‣ .
Given , admits the characterization ,
where . Let be the conditional
density function of given .
p × 1x = (x1, …, xp)⊺ x1 ≡ 1 y
Q(τ, x) = F−1y|x(τ) ≈ x⊺β*(τ) τ ∈ [τL, τU] ⊆ (0,1)
(y1, x1), …, (yn, xn)iid∼ (y, x)
τ ∈ [τL, τU] (y, x)y ≈ x⊺β*(τ) + ε(τ)
ℙ{ε(τ) ≤ 0 |x} = τ fε(τ)( ⋅ |x)ε(τ) x
of 6 27
QR-series Approximation for NP Model
‣ QR-series approximation to : fix , let be a vector of series
approximating functions of dimension .
B-splines (or regression splines)PolynomialsFourier seriesCompactly supported wavelets
‣ QR-series approximation error vanishes asymptotically under appropriate conditions when as (Belloni, et al., 2019).
x ↦ Q(τ, x) τx ↦ Z(x) = (Z1(x), …, Zm(x))⊺
m = mn
R(τ, x) = Q(τ, x) − Z(x)⊺β*(τ)
m = mn → ∞ n → ∞
of 7 27
Quantile RegressionGiven , the standard QR estimator is
,
where is the check function.
• robustness against outliers in the response, especially in the case of median regression
• ability to capture heterogeneity in the set of important predictors at different quantile levels of the response distribution caused by, e.g., heteroscedastic variance
τ ∈ (0,1)β = β(τ) = argminβ∈ℝp
n
∑i=1
ρτ(yi − x⊺i β)
ρτ(u) = {τ − 1(u < 0)}u
of 8 27
Theoretical Development
‣ Consistency: Bassett & Koenker (1986), Zhao, Rao & Chen (1993), El Bantli & Hallin (1999), etc.
‣ Rate of convergence: Ruppert & Carroll (1980), Pollard (1991), Hjort & Pollard (1993), Knight (1998), etc.
‣ Bahadur representation & Normal approximation: Jureckova & Sen (1984), Portnoy (1984), Koenker & Portnoy (1987), Portnoy & Koenker (1989), Gutenbrunner & Jureckova (1992), Hendricks & Koenker (1991), He & Shao (1996, 2000), Arcones (1996), Koenker & Machado (1999), Koenker & Xiao (2002).
Classical Asymptotics: and is fixed. n → ∞ p
of 9 27
Challenges in QR• Lack of strong convexity: quantile loss is piecewise linear and
its “curvature energy” is concentrated in a single point. This is substantially different from other popular loss functions, e.g.
, logistic and Huber, or even Tukey and Hampel, which are at least locally strongly convex.
• Lack of smoothness: quantile loss is not everywhere differentiable. Theoretically, it leads to an error term of order
or in the Bahadur representation.
‣ Welsh (1989) shows that suffices for normal approximation (fixed design).
‣ Huber regression requires .
ℓ2
𝒪ℙ(n−1/4) 𝒪ℙ{(p3/n)1/4}
p3 log2(n) = o(n)
p2 = o(n)
of 10 27
Smoothed Estimating Equation (SEE) Approach- Population loss , and
.
- If is continuously differentiable, is twice differentiable
and strongly convex at least in a neighborhood of . Moreover, satisfies the first-order condition:
.- Sample analog:
.
The QR estimator solves this equation approximately.
Q(β) = 𝔼(y,x)∼P ρτ(y − x⊺β)β* = β*(τ) = arg min
β∈ℝpQ(β)
Fε(τ)|x Qβ*
β*∇Q(β) = 𝔼{1(y < x⊺β) − τ}x |β=β* = 0
1n
n
∑i=1
{1(yi − x⊺i β < 0) − τ}xi = 0
β
of 11 27
- SEE approach (Whang 2006; Kaplan & Sun, 2017):
,
where , is a smooth function and
is the bandwidth.
- Horowitz’s method (Horowitz, 1998): smooth the criterion function by replacing the indicator in the check function by a kernel counterpart:
.
- Horowitz’s smoothed check function is not convex!
1n
n
∑i=1
{G(−ri(β)/h) − τ}xi = 0
ri(β) = yi − x⊺i β G
h > 0
ℓHh (u) = u{τ − G(−u /h)}
of 12 27
-estimation Viewpoint- Smoothed loss function
,
where is a kernel function, and
.
- Convolution smoothed QR (conquer):
.
M
Qh(β) =1n
n
∑i=1
ρτ * Kh
=:ℓh
(yi − x⊺i β)
K Kh(u) = (1/h)K(u /h)
ℓh(u) = (ρτ * Kh)(u) = ∫∞
−∞ρτ(u)Kh(v − u) dv
βh = βh(τ) ∈ arg minβ∈ℝp
Qh(β)
of 13 27
- Any optimum satisfies the FOC
,
where .
- Convexity:
.
Provided that is non-negative, is convex and hence any minimizer satisfies the first-order moment condition. - Fixed- asymptotics: Fernandes, Guerre & Horta (2021).- Growing- (non)asymptotics: He, et al. (2020).
βh
∇Qh(β) =1n
n
∑i=1
{K(−ri(β)/h) − τ}xi
K(u) = ∫u
−∞K(t) dt
∇2Qh(β) =1n
n
∑i=1
Kh(yi − x⊺i β) ⋅ xix⊺
i
K Qh
pp
of 14 27
Convolution versus Deconvolution
Adding noise to the response leads to noisy QR
. Since the noise distribution can be specified, consider
.
‣ Gaussian kernel/noise: .
‣ Uniform kernel/noise: .
‣ Laplacian kernel/noise: .
{ui}
minβ
1n
n
∑i=1
ρτ(yi + ui − x⊺i β)
minβ
1n
n
∑i=1
𝔼u{ρτ(yi + ui − x⊺i β)}
ui ∼ N(0, h2)
ui ∼ Unif(−h, h)
ui ∼ Laplace(0, h)
of 15 27
Computational Methods
‣ Gradient descent (GD): starting at iteration 0 with an initial estimate , at iteration , computes
,
where .
‣ Barzilai-Borwein step size (Barzilai & Borwein, 1988): for define
, .
The BB step sizes are
and .
β0 t = 0,1,2,…,βt+1 = βt−ηt ⋅ ∇Qh( βt)
∇Qh(β) = (1/n)∑i {K( x⊺i β − yi
h ) − τ}xi
t = 1,2,…,
δt = βt − βt−1 gt = ∇Qh( βt) − ∇Qh( βt−1)
η1,t = ⟨δt, δt⟩/⟨δt, gt⟩ η2,t = ⟨δt, gt⟩/⟨gt, gt⟩
of 16 27
‣ As or , the Hessian becomes more ill-conditioned:
.
The step sizes computed in GD-BB may sometimes vibrate drastically, causing instability of the algorithm. To stabilize the algorithm, we take
,
For example, .
‣ Scale the covariate inputs to have zero mean and/or unit variance before applying the GD-BB method.
τ ≈ 0 1
∇2Qh(β) =1n
n
∑i=1
Kh(yi − x⊺i β) ⋅ xix⊺
i
ηt = min{η1,t, η2,t, C} t = 1,2,…
C = 10
of 17 27
Initialization via Expectile Regression
‣ Asymmetric quadratic loss (Newey & Powell, 1987)
.
‣ Given a univariate random variable with , the scale parameter
is called the -expectile (Newey & Powell, 1987) or Efron’s
-mean with (Efron, 1991).
eτ(u) = |τ − 1(u < 0) | ⋅ u2/2
Z 𝔼 |Z | < ∞
eτ = arg minu∈ℝ
𝔼{eτ(Z − u) − eτ(Z)}
τ
ω ω = τ/(1 − τ)
of 18 27
Robustified Expectile Regression (retire)
‣ Robustified asymmetric quadratic loss:
,
where ( ) is the Huber loss
.
‣We use retire estimator as an initial estimate:
.
‣When , this becomes Huber’s -estimator.
rc(u) = |τ − 1(u < 0) | ⋅ Hc(u)
Hc c > 0
Hc(u) = 0.5u21( |u | ≤ c) + (c |u | − c2/2)1( |u | > c)
β0c ∈ arg min
β
1n
n
∑i=1
rc(yi − x⊺i β)
τ = 1/2 M
of 19 27
Bootstrap Inference with Conquer
Let be iid with , .
The bootstrapped conquer is defined as
, .
(i) (negative weights breaks the convexity)
(ii)
(iii) (recommended for large samples)
{wi}ni=1 𝔼(wi) = 1 Var(wi) = 1
β♭ ∈ arg minβ
Q♭h(β) Q♭
h =1n
n
∑i=1
wiℓh(yi − x⊺i β)
wi ∼ 𝒩(1,1)
wi ∼ Exp(1)
wi ∼ 2Ber(0.5)
of 20 27
Normal Distribution Calibrated Inference
‣ As shown in He, et al (2020), for each ,
where is the -th diagonal entry of
.
and .
‣ Kernel-type matrix estimators:
.
Here we use the same bandwidth as for the estimator.
j = 1,…, p
n1/2σ−1j ( βh − β*)j
d 𝒩(0, 1)
σ2j j
H−1E[{K(−εi /h) − τ}2xx⊺]H−1
H = 𝔼{fε(0 |x)xx⊺}
H =1
nh
n
∑i=1
ϕ( εi /h)xix⊺i , Σ(τ) =
1n
n
∑i=1
{K(− εi /h) − τ}2xix⊺i
of 21 27
Penalized Conquer in High Dimensions
is sparse: .
- -penalized QR (Belloni & Chernozhukov, 11):
,
where .
- R package: "quantreg", "FHDQR", "rqPen", etc.
• penalty: introduce non-negligible estimation bias.
β* ∈ ℝp ∥β*∥0 ≤ s ≪ n ≪ p
ℓ1
minβ∈ℝp
1n
n
∑i=1
ρτ(yi − x⊺i β) + λ∥β∥1
λ ≍ τ(1 − τ) ⋅ log(p)/n
ℓ1
of 22 27
Concave Regularization
Smoothly Clipped Absolute Deviation: Fan & Li (2001)
of 23 27
Minimax Concave Penalty: C.-H. Zhang (2010)
of 24 27
Iteratively Reweighed -conquer
Starting with an initial estimate , at
iteration , solve the convex optimization problem
and obtain .
ℓ1
β(0) = ( β(0)1 , …, β(0)
p )⊺
t = 1,…, T
minβ
1n
n
∑i=1
ℓh(yi − x⊺i β)
Qh(β)
+p
∑j=2
q′ λ( | β(t−1)j | ) |βj |
∥λ(t−1)∘β−∥1
{ β(t)}Tt=1
of 25 27
- , is concave, increasing
and .
SCAD: ;
MCP: ;
Capped : .
Here is a constant, say .
- We apply an Iterative Local Adaptive Majorize-Minimize (ILAMM) algorithm—a proximal gradient descent method—to compute weighted -penalized conquer.
qλ(t) = λ2q(t/λ) q : [0,∞) → [0,∞)
q(0) = 0
q′ (t) = min {1, (1 −t − 1a − 1 )
+}
q′ (t) = (1 − t/a)+
ℓ1 q′ (t) = 1(t ≤ a /2)a > 2 a = 3.7
ℓ1
of 26 27
Given the previous iterate , define an isotropic quadratic objective function
Minimizing yields ,
where is the soft-thresholding operator.
The quadratic coefficient is chosen such that.
Starting at a small value, say , iteratively increase by a factor of until the majorization property is met.
β(ℓ−1)
F(β; ϕ, β(ℓ−1))
= Qh(β(ℓ−1)) + ⟨∇Qh(β(ℓ−1)), β − β(ℓ−1)⟩ +ϕ2
∥β − β(ℓ−1)∥22.
F(β; ϕ, β(ℓ−1)) + ∥λ ∘ β∥1
β(ℓ) = Ssoft(β(ℓ−1) − ∇Qh(β(ℓ−1))/ϕ, λ/ϕ)Ssoft(x, c) = sign(x) max( |x | − c,0}
ϕF(β(ℓ); ϕ, β(ℓ−1)) ≥ Qh(β(ℓ−1))
ϕ = 0.01 ϕγ = 1.25
of 27 27
Thank you for your attention. Softwares
R: https://cran.r-project.org/web/packages/conquer/
Python: https://github.com/WenxinZhou/Conquer
Manuscripts
He, X., Pan, X., Tan, K. M. & Zhou, W.-X. (2020). Smoothed quantile regression with large-scale inference. arXiv:2012.05187.Tan, K. M., Wang, L. & Zhou, W.-X. (2020). High-dimensional quantile regression: convolution smoothing and concave regularization. Preprint.