conquer: Convolution Smoothed Quantile Regression

of 1 27

conquer: Convolution Smoothed Quantile Regression

Wenxin Zhou

Joint work with Xuming He, Lan Wang, Kean Ming Tan & Xiaoou Pan

2021 WNAR Conference

of 2 27

‣ The idea of median regression predates the least squares method by about 50 years.

‣ Roger Joseph Boscovich (1760), Pierre-Simon Laplace (1789), F. Y. Edgeworth (1887).

‣ The method of least squares ( -method) has significant computational advantages over -method—minimization of absolute errors advocated by Boscovich, Laplace and others, even though the latter is more robust.

‣ Regression quantiles: Koenker and Bassett, Jr. (1978).

ℓ2

ℓ1

of 3 27

Computational Development

‣ Simplex-based algorithms: Barrodale & Roberts (1974), Koenker & d’Orey (1987).Slow in larger samples, worst-case analysis suggests the number of iterations may increase exponentially with the sample size.

‣ Interior point method: Newton-Frisch algorithm (Portnoy & Koenker, 1997) has computational complexity , conjectured to be improvable to (Mizuno-Todd-Ye conjecture).

‣ Preprocessing (Portnoy & Koenker, 1997): improved complexity , improvable to .

‣ R "quantreg", MATLAB, SAS, etc.

𝒪(n1+ap3 log n)(0 < a < 1/2)𝒪(np3 log2(n))

𝒪((np)2(1+a)/3p3 log n + np) 𝒪(n2/3p3 log2(n) + np)

of 4 27

Quantile regression versus meanregression

‣Mean regression: , .

‣ Quantile regression:

No strict distinction between ‘signal’ and ‘noise’.Object of interest: the conditional distribution of .Contains richer information than conditional mean.

yi = m(xi) + εi

𝔼(εi |xi) = 0

ℙ{y ≤ Q(τ, x) |x} = τ

y |x

of 5 27

Model and Assumptions

‣ Goal: learn the effect of a vector of covariates ( ) on the entire distribution of .

‣ Conditional quantile model: , .

‣ .

Given , admits the characterization ,

where . Let be the conditional

density function of given .

p × 1x = (x1, …, xp)⊺ x1 ≡ 1 y

Q(τ, x) = F−1y|x(τ) ≈ x⊺β*(τ) τ ∈ [τL, τU] ⊆ (0,1)

(y1, x1), …, (yn, xn)iid∼ (y, x)

τ ∈ [τL, τU] (y, x)y ≈ x⊺β*(τ) + ε(τ)

ℙ{ε(τ) ≤ 0 |x} = τ fε(τ)( ⋅ |x)ε(τ) x

of 6 27

QR-series Approximation for NP Model

‣ QR-series approximation to : fix , let be a vector of series

approximating functions of dimension .

B-splines (or regression splines)PolynomialsFourier seriesCompactly supported wavelets

‣ QR-series approximation error vanishes asymptotically under appropriate conditions when as (Belloni, et al., 2019).

x ↦ Q(τ, x) τx ↦ Z(x) = (Z1(x), …, Zm(x))⊺

m = mn

R(τ, x) = Q(τ, x) − Z(x)⊺β*(τ)

m = mn → ∞ n → ∞

of 7 27

Quantile RegressionGiven , the standard QR estimator is

,

where is the check function.

• robustness against outliers in the response, especially in the case of median regression

• ability to capture heterogeneity in the set of important predictors at different quantile levels of the response distribution caused by, e.g., heteroscedastic variance

τ ∈ (0,1)β = β(τ) = argminβ∈ℝp

n

∑i=1

ρτ(yi − x⊺i β)

ρτ(u) = {τ − 1(u < 0)}u

of 8 27

Theoretical Development

‣ Consistency: Bassett & Koenker (1986), Zhao, Rao & Chen (1993), El Bantli & Hallin (1999), etc.

‣ Rate of convergence: Ruppert & Carroll (1980), Pollard (1991), Hjort & Pollard (1993), Knight (1998), etc.

‣ Bahadur representation & Normal approximation: Jureckova & Sen (1984), Portnoy (1984), Koenker & Portnoy (1987), Portnoy & Koenker (1989), Gutenbrunner & Jureckova (1992), Hendricks & Koenker (1991), He & Shao (1996, 2000), Arcones (1996), Koenker & Machado (1999), Koenker & Xiao (2002).

Classical Asymptotics: and is fixed. n → ∞ p

of 9 27

Challenges in QR• Lack of strong convexity: quantile loss is piecewise linear and

its “curvature energy” is concentrated in a single point. This is substantially different from other popular loss functions, e.g.

, logistic and Huber, or even Tukey and Hampel, which are at least locally strongly convex.

• Lack of smoothness: quantile loss is not everywhere differentiable. Theoretically, it leads to an error term of order

or in the Bahadur representation.

‣ Welsh (1989) shows that suffices for normal approximation (fixed design).

‣ Huber regression requires .

ℓ2

𝒪ℙ(n−1/4) 𝒪ℙ{(p3/n)1/4}

p3 log2(n) = o(n)

p2 = o(n)

of 10 27

Smoothed Estimating Equation (SEE) Approach- Population loss , and

.

- If is continuously differentiable, is twice differentiable

and strongly convex at least in a neighborhood of . Moreover, satisfies the first-order condition:

.- Sample analog:

.

The QR estimator solves this equation approximately.

Q(β) = 𝔼(y,x)∼P ρτ(y − x⊺β)β* = β*(τ) = arg min

β∈ℝpQ(β)

Fε(τ)|x Qβ*

β*∇Q(β) = 𝔼{1(y < x⊺β) − τ}x |β=β* = 0

1n

n

∑i=1

{1(yi − x⊺i β < 0) − τ}xi = 0

β

of 11 27

- SEE approach (Whang 2006; Kaplan & Sun, 2017):

,

where , is a smooth function and

is the bandwidth.

- Horowitz’s method (Horowitz, 1998): smooth the criterion function by replacing the indicator in the check function by a kernel counterpart:

.

- Horowitz’s smoothed check function is not convex!

1n

n

∑i=1

{G(−ri(β)/h) − τ}xi = 0

ri(β) = yi − x⊺i β G

h > 0

ℓHh (u) = u{τ − G(−u /h)}

of 12 27

-estimation Viewpoint- Smoothed loss function

,

where is a kernel function, and

.

- Convolution smoothed QR (conquer):

.

M

Qh(β) =1n

n

∑i=1

ρτ * Kh

=:ℓh

(yi − x⊺i β)

K Kh(u) = (1/h)K(u /h)

ℓh(u) = (ρτ * Kh)(u) = ∫∞

−∞ρτ(u)Kh(v − u) dv

βh = βh(τ) ∈ arg minβ∈ℝp

Qh(β)

of 13 27

- Any optimum satisfies the FOC

,

where .

- Convexity:

.

Provided that is non-negative, is convex and hence any minimizer satisfies the first-order moment condition. - Fixed- asymptotics: Fernandes, Guerre & Horta (2021).- Growing- (non)asymptotics: He, et al. (2020).

βh

∇Qh(β) =1n

n

∑i=1

{K(−ri(β)/h) − τ}xi

K(u) = ∫u

−∞K(t) dt

∇2Qh(β) =1n

n

∑i=1

Kh(yi − x⊺i β) ⋅ xix⊺

i

K Qh

pp

of 14 27

Convolution versus Deconvolution

Adding noise to the response leads to noisy QR

. Since the noise distribution can be specified, consider

.

‣ Gaussian kernel/noise: .

‣ Uniform kernel/noise: .

‣ Laplacian kernel/noise: .

{ui}

minβ

1n

n

∑i=1

ρτ(yi + ui − x⊺i β)

minβ

1n

n

∑i=1

𝔼u{ρτ(yi + ui − x⊺i β)}

ui ∼ N(0, h2)

ui ∼ Unif(−h, h)

ui ∼ Laplace(0, h)

of 15 27

Computational Methods

‣ Gradient descent (GD): starting at iteration 0 with an initial estimate , at iteration , computes

,

where .

‣ Barzilai-Borwein step size (Barzilai & Borwein, 1988): for define

, .

The BB step sizes are

and .

β0 t = 0,1,2,…,βt+1 = βt−ηt ⋅ ∇Qh( βt)

∇Qh(β) = (1/n)∑i {K( x⊺i β − yi

h ) − τ}xi

t = 1,2,…,

δt = βt − βt−1 gt = ∇Qh( βt) − ∇Qh( βt−1)

η1,t = ⟨δt, δt⟩/⟨δt, gt⟩ η2,t = ⟨δt, gt⟩/⟨gt, gt⟩

of 16 27

‣ As or , the Hessian becomes more ill-conditioned:

.

The step sizes computed in GD-BB may sometimes vibrate drastically, causing instability of the algorithm. To stabilize the algorithm, we take

,

For example, .

‣ Scale the covariate inputs to have zero mean and/or unit variance before applying the GD-BB method.

τ ≈ 0 1

∇2Qh(β) =1n

n

∑i=1

Kh(yi − x⊺i β) ⋅ xix⊺

i

ηt = min{η1,t, η2,t, C} t = 1,2,…

C = 10

of 17 27

Initialization via Expectile Regression

‣ Asymmetric quadratic loss (Newey & Powell, 1987)

.

‣ Given a univariate random variable with , the scale parameter

is called the -expectile (Newey & Powell, 1987) or Efron’s

-mean with (Efron, 1991).

eτ(u) = |τ − 1(u < 0) | ⋅ u2/2

Z 𝔼 |Z | < ∞

eτ = arg minu∈ℝ

𝔼{eτ(Z − u) − eτ(Z)}

τ

ω ω = τ/(1 − τ)

of 18 27

Robustified Expectile Regression (retire)

‣ Robustified asymmetric quadratic loss:

,

where ( ) is the Huber loss

.

‣We use retire estimator as an initial estimate:

.

‣When , this becomes Huber’s -estimator.

rc(u) = |τ − 1(u < 0) | ⋅ Hc(u)

Hc c > 0

Hc(u) = 0.5u21( |u | ≤ c) + (c |u | − c2/2)1( |u | > c)

β0c ∈ arg min

β

1n

n

∑i=1

rc(yi − x⊺i β)

τ = 1/2 M

of 19 27

Bootstrap Inference with Conquer

Let be iid with , .

The bootstrapped conquer is defined as

, .

(i) (negative weights breaks the convexity)

(ii)

(iii) (recommended for large samples)

{wi}ni=1 𝔼(wi) = 1 Var(wi) = 1

β♭ ∈ arg minβ

Q♭h(β) Q♭

h =1n

n

∑i=1

wiℓh(yi − x⊺i β)

wi ∼ 𝒩(1,1)

wi ∼ Exp(1)

wi ∼ 2Ber(0.5)

of 20 27

Normal Distribution Calibrated Inference

‣ As shown in He, et al (2020), for each ,

where is the -th diagonal entry of

.

and .

‣ Kernel-type matrix estimators:

.

Here we use the same bandwidth as for the estimator.

j = 1,…, p

n1/2σ−1j ( βh − β*)j

d 𝒩(0, 1)

σ2j j

H−1E[{K(−εi /h) − τ}2xx⊺]H−1

H = 𝔼{fε(0 |x)xx⊺}

H =1

nh

n

∑i=1

ϕ( εi /h)xix⊺i , Σ(τ) =

1n

n

∑i=1

{K(− εi /h) − τ}2xix⊺i

of 21 27

Penalized Conquer in High Dimensions

is sparse: .

- -penalized QR (Belloni & Chernozhukov, 11):

,

where .

- R package: "quantreg", "FHDQR", "rqPen", etc.

• penalty: introduce non-negligible estimation bias.

β* ∈ ℝp ∥β*∥0 ≤ s ≪ n ≪ p

ℓ1

minβ∈ℝp

1n

n

∑i=1

ρτ(yi − x⊺i β) + λ∥β∥1

λ ≍ τ(1 − τ) ⋅ log(p)/n

ℓ1

https://cran.r-project.org/web/packages/quantreg/

http://users.stat.umn.edu/~zouxx019/ftpdir/code/fhdqr/

https://cran.r-project.org/web/packages/rqPen/index.html

of 22 27

Concave Regularization

Smoothly Clipped Absolute Deviation: Fan & Li (2001)

of 23 27

Minimax Concave Penalty: C.-H. Zhang (2010)

of 24 27

Iteratively Reweighed -conquer

Starting with an initial estimate , at

iteration , solve the convex optimization problem

and obtain .

ℓ1

β(0) = ( β(0)1 , …, β(0)

p )⊺

t = 1,…, T

minβ

1n

n

∑i=1

ℓh(yi − x⊺i β)

Qh(β)

+p

∑j=2

q′ λ( | β(t−1)j | ) |βj |

∥λ(t−1)∘β−∥1

{ β(t)}Tt=1

of 25 27

- , is concave, increasing

and .

SCAD: ;

MCP: ;

Capped : .

Here is a constant, say .

- We apply an Iterative Local Adaptive Majorize-Minimize (ILAMM) algorithm—a proximal gradient descent method—to compute weighted -penalized conquer.

qλ(t) = λ2q(t/λ) q : [0,∞) → [0,∞)

q(0) = 0

q′ (t) = min {1, (1 −t − 1a − 1 )

+}

q′ (t) = (1 − t/a)+

ℓ1 q′ (t) = 1(t ≤ a /2)a > 2 a = 3.7

ℓ1

of 26 27

Given the previous iterate , define an isotropic quadratic objective function

Minimizing yields ,

where is the soft-thresholding operator.

The quadratic coefficient is chosen such that.

Starting at a small value, say , iteratively increase by a factor of until the majorization property is met.

β(ℓ−1)

F(β; ϕ, β(ℓ−1))

= Qh(β(ℓ−1)) + ⟨∇Qh(β(ℓ−1)), β − β(ℓ−1)⟩ +ϕ2

∥β − β(ℓ−1)∥22.

F(β; ϕ, β(ℓ−1)) + ∥λ ∘ β∥1

β(ℓ) = Ssoft(β(ℓ−1) − ∇Qh(β(ℓ−1))/ϕ, λ/ϕ)Ssoft(x, c) = sign(x) max( |x | − c,0}

ϕF(β(ℓ); ϕ, β(ℓ−1)) ≥ Qh(β(ℓ−1))

ϕ = 0.01 ϕγ = 1.25

of 27 27

Thank you for your attention. Softwares

R: https://cran.r-project.org/web/packages/conquer/

Python: https://github.com/WenxinZhou/Conquer

Manuscripts

He, X., Pan, X., Tan, K. M. & Zhou, W.-X. (2020). Smoothed quantile regression with large-scale inference. arXiv:2012.05187.Tan, K. M., Wang, L. & Zhou, W.-X. (2020). High-dimensional quantile regression: convolution smoothing and concave regularization. Preprint.

https://cran.r-project.org/web/packages/conquer/

https://github.com/WenxinZhou/Conquer

Documents

conquer: Convolution Smoothed Quantile Regression