Convex Optimization Methods for Quantile Regressionstat.cmu.edu/~ceubanks/quantile_regression.pdfa data set of galaxy morphological statistics. Cosmological data often do not come

Convex Optimization Methods for QuantileRegression

Purvasha ChakravartiDepartment of Statistics & Data Science

Machine Learning DepartmentCarnegie Mellon University

Pittsburgh, PA [email protected]

Collin EubanksDepartment of Statistics & Data Science



Jining QinDepartment of Statistics & Data Science



Abstract

The fundamental idea of statistical regression is to use a set of noisy data points(x1, y1), . . . , (xn, yn) to constuct an estimate of some underlying functional rof the distribution of a response Y , conditioned on some set of features X . Theordinary least squares (OLS) estimator, for example, estimates the conditionalexpectation of Y given X . The conditional expectation is most often the quantityof interest but there are some problems where we are more interested in theconditional quantiles.It was shown by Koenker and Bassett (1978) that a simple minimization problemyields the ordinary sample quantiles which can be generalized naturally to thelinear model generating a new class of statistics called “regression quantiles” whichperform better than the least-squares estimator over a wide class of non-Gaussianerror distributions (1),(2).

1 Introduction

Quantile regression (QR) may be viewed as an extension of the classical regression framework. Whilethe latter estimates the conditional expectation of a response variable Y given a set of covariates X ,QR instead estimates conditional quantile functions.

For example, least squares regression problem is represented as the following,

argminr∈G

R(Y, r(X)) (1)

where

R(Y, r(X)) =1

n

n∑i=1

(yi − r(xi))2

Preprint. Work in progress.

Figure 1: Salaries (in 1991) for 459 statistics professors in U.S. colleges and universities vs. years ofexperience.

is the risk function and G is some class of functions. Quantile regression, on the other hand, utilizesthe following risk function:

R(Y, r(X)) =1

n

n∑i=1

ρτ (yi − r(xi)), (2)

where the “tilting” function ρτ , 0 < τ < 1 is defined as

ρτ (x) =

τ · x if x ≥ 0,

(τ − 1) · x if x < 0.(3)

For example, when analyzing salaries over a population we are often more interested in the medianand other quantiles instead of the mean since the mean is easily skewed by outliers (think Bill Gates,Mark Zuckerberg, etc.). The salaries for 459 statistics professors in U.S. colleges and universities areshown in Figure 1 as a function of their years of experience. Quantile regression can be used for thispurpose.

The solution of this optimization is the conditional τ -quantile function. By letting τ = 0.5, we getback the conditional median function. Figure 2 illustrates the idea behind QR and contrasts it withOLS.

1.1 Nonparametric Quantile Regression

Various methods of nonparametric estimation of conditional quantiles have been proposed. Stone(1977) and Truong (1989) utilized a nearest-neighbors approach; Samanta (1989) and Antoch andJanssen (1989) used kernel methods; and White (1990) used neural networks. Cox and Jones (1988)proposed smoothing splines which include a penalization term for smoothness. A variant of thisis considered by Koenker et al (1994) utilizing a different regularization term (3). Takeuchi et al(2006) presented another nonparametric quantile regression method which uses a RKHS norm as theregularization term and finds a function in the kernel space to estimate r (4).

In our project we are going to formulate linear quantile regression and two nonparametric quantileregression methods as convex optimization problems and discuss the different methods that can be

2

Figure 2: Quantile regression vs. OLS regression

used to solve them. In general we consider problems of the form

argminr∈G

1

n

n∑i=1

ρτ (yi − r(xi)) + λh(r), (4)

where G is some class of continuous functions and h(r) is a regularization term.

We review the theoretical properties of the different methods. We show that linear quantile regressionand quantile smoothing splines with `1 penalty can be reformulated as linear programs (LPs), each ofwhich can easily be solved using the simplex algorithm. The quantile smoothing spline with `2 norm

argming∈G

1

n

n∑i=1

ρτ (yi − g(xi)) + λ

(∫Ω

|g′′(x)|2dx)

(5)

is a quadratic program (QP) and creates some significant computational problems (3). Bosch, Ye& Woodworth (1994) discuss an interior point algorithm for solving this problem (5). The quantileregression regularized by RKHS norm is also shown to be a QP but, unlike the smoothing splines, thedual of this problem can be implemented very efficiently.

To compare these methods, we will apply them to a real world data set. In particular, we test them ona data set of galaxy morphological statistics. Cosmological data often do not come with clear linearstructure. The conditional distributions are often skewed, which warrants the use of nonparametricquantile regression.

2 Quantile Regression Methods: The Old and the New

2.1 A Different Perspective for Defining Quantile

The τ -th quantile is the inverse of cumulative distribution function at τ :QY (τ) = F−1(τ) = infy : F (y) > τ.

Quantiles could also be defined in a less common but very helpful way, as the minimizer of weightedabsolute sum of deviations. That is,

QY (τ) = argminc

E[ρτ (Y − c)]

where ρτ is the tilting function

ρτ (x) =

τ · x if x ≥ 0,

(τ − 1) · x if x < 0.

When we have a set of observed data values yi, then the empirical τ -th quantile is defined as

argminc

1

n

n∑i=1

ρτ (yi − c) (6)

3

This can be rewritten as the following linear program:

minu∈Rn,v∈Rn,c∈R

n∑i=1

[τui + (1− τ)vi

](7)

subject to:yi = c+ ui − vi , ui ≥ 0 , vi ≥ 0, i = 1, · · · , n.

This is a linear program that can be efficiently solved by the simplex algorithm developed by Dantzig(1947).

2.2 Linear Quantile Regression

The linear quantile optimization is a very simplistic case of (4). We assume the conditional quantilesof Y can be written as linear combinations of the features X and solve the following optimizationproblem:

argminβ∈Rp

1

n

n∑i=1

ρτ (yi − xiβ). (8)

Similarly, the problem can be rewritten as:

minu∈Rn,v∈Rn,β∈Rp

n∑i=1

[τui + (1− τ)vi] (9)

subject to:yi = xiβ + ui − vi , ui ≥ 0 , vi ≥ 0, i = 1, · · · , n.

Again this is solvable using the simplex algorithm. Koenker and Ng (2004) proposed a Frisch-NewtonAlgorithm for the sparse case under this setting.

2.3 Quantile Smoothing Splines

We now consider a much more flexible class of functions. Specifically, let G be the class of alltwice-continuously-differentiable functions. We obtain a Quantile Smoothing Spline (QSS) with thefollowing optimization over this class:

argminr∈G

1

n

n∑i=1

ρτ(yi − r(xi)

)+ λ

(∫Ω

|r′′(x)|pdx) 1

p

. (10)

The following theorem gives a somewhat surprising result that allows us to narrow our focus to amuch smaller subset of G.

Theorem 2.1 When p = 1, the function r ∈ G minimizing 1n

∑ni=1 ρτ

(yi − r(xi)

)+

λ(∫

Ω|r′′(x)|pdx

) 1p is a linear spline with knots at the points xi, i = 1, . . . , n.

Therefore, in this setting r can be represented as the following linear combination of the truncatedpower basis functions of order k = 1 with knots at the observations xi,

r =

n+k+1∑i=1

βihi

whereh1(x) = 1 , h2(x) = x , · · · , hk+1(x) = xk

hk+1+i = (x− xi)k+ , i = 1, · · · , n.See (3) for a proof of this result. (3) also shows that this can be reformulated as a linear program(LP).

Note: When p = 2, (10) can be shown to be a quadratic program (QP) with the solution beinga piecewise quadratic polynomial. However, this is more computationally expensive so we willconsider the p = 1 case.

4

2.4 Quantile Regression Penalized by a RKHS Norm

The RKHS (Reproducing Kernel Hilbert Space) norm was used as a regularizer by Takeuchi et al(2006) (4). The main purpose of doing this was to show its relation to support vector regression anddemonstrate its performance compared to existing methods as mentioned previously. The optimizationproblem is stated as follows:

argminr∈H

1

n

n∑i=1

ρτ (yi − r(xi)) +λ

2||r||2H (11)

where f = g + b and b ∈ R. Here, || · ||H is the RKHS norm and we require r ∈ H. The r thatminimizes the above problem gives the desired quantile regression estimator.

Now we show that this problem can be formulated as a QP. Using the connection between the RKHSand feature spaces, we write f(x) = 〈φ(x), w〉 + b for some w as g = f − b ∈ H. We introduceauxiliary variables ξi ≥ 0 and ξ∗i ≥ 0 such that

yi − 〈φ(xi), w〉 − b ≤ ξi and 〈φ(xi), w〉+ b− yi ≤ ξ∗i

Then the RKHS norm quantile regression problem is equivalent to solving the following minimizationproblem.

minw,b,ξ∗i ,ξi

C

n∑i=1

[τξi + (1− τ)ξ∗i ] +1

2||w||2 (12)

subject to yi − 〈φ(xi), w〉 − b ≤ ξi〈φ(xi), w〉+ b− yi ≤ ξ∗iξi, ξ

∗i ≥ 0.

Here C = 1λn . We also derive the dual of this problem as it is more useful for efficient implemenation.

The constraints for ξi and ξ∗i can be combined into one variable αi. This yields the dual problem,

minα

1

2αTKα− αT y (13)

subject to C(τ − 1) ≤ αi ≤ Cτ ∀ 1 ≤ i ≤ n1Tα = 0

where Kij = k(xi, xj) and k is the reproducing kernel corresponding to H. The primal solutionscan be derived from the KKT conditions by

w =∑i

αiφ(xi) or f(x) =∑i

αik(xi, x) + b.

b = 0 and f(xi) = yi if αi /∈ [C(τ − 1), Cτ ].

A practical advantage of the dual is that it can be directly solved using any QP solver. We use thismethod with a Gaussian RBF (Radial Basis Function) kernel with automatic kernel width (Ω2) andregularization parameter C.

3 Results

3.1 Comparing the methods on Doppler function

To compare the methods we generate a synthetic data set so we can accurately assess how well eachmethod is approximating the underlying function. Specifically we consider the Doppler function as

5

0.0 0.2 0.4 0.6 0.8 1.0

−0.

50.

00.

5

Doppler function

X

Y

TRUELQRQSSRKHS

Figure 3: Quantile regressions for the Doppler function with τ = 0.5. We notice that the quantileregression with RKHS norm penalization performs the best compared to the other two.

it is well-known to be difficult to recover from noisy data. We generate 1000 data points from thefollowing model:

yi =√xi(1− xi) sin

(2.1π

xi + 0.05

)+ εi, i = 1, ..., 1000

where εi ∼ N(0, 0.01). The x′is are equally spaced between 0.01 and 0.99. We compare the linearquantile regression, the smoothing splines with `1 loss and λ = 1 and quantile regression penalizedby RKHS norm with a Gaussian RBF (Radial Basis Function) kernel which chooses σ2 = 12.926and C = 0.1. The resulting fits are shown in Figure 3. The quantile regression with RKHS normpenalization outperforms the other two methods. But since it is a QP whereas the linear quantileregression and the quantile smoothing spline are LPs, it is more computationally expensive. Figure 4gives the computational time taken by the different algorithms versus the number of samples beingconsidered for the regression.

0.00

0.05

0.10

0.15

0.20

250 500 750 1000

n

Tim

e (

se

co

nd

s) Method

linear

smoothing spline

RKHS

Figure 4: CPU time taken by the different algorithms versus the number of samples considered.

6

Figure 5: The galaxy to the left has two light centers, hence its Multimode statistic is positive. On theother hand, the galaxy on the right has a single center and so its M-statistic is zero.

Since we are fitting the values on a simulated data set, we can compare the actual tilting scores∑ρ(r(xi)− yi),

for each fit, where r(xi) is the doppler function. The tilting score for the naive linear quantileregression is 119.47, for quantile smoothing splines is 54.44 and for quantile regression with RKHSnorm penalization is 47.54. Thus, the RKHS norm penalized quantile regression outperforms theothers; however, it is also the most computationally expensive.

3.2 Application on astronomy data set

We now apply the methods to an astronomical data set describing the evolution of galaxy morphology.The data was collected from images of 3385 galaxies taken by the Hubble Great Observatories OriginsDeep Survey (GOODS) South Field. For each galaxy we have photometric redshift measurementsand morphological summary statistics.

Galaxy morphology refers to the shape and appearance of galaxies. We model the Multimode (M)statistic, which quantifies whether a galaxy has two light centers in its image. As is shown in Figure 5,the galaxy on the left clearly shows two light centers and has a M statistic above zero, which indicatesit has merging activity recently, while the galaxy on the right is clearly uni-modal, has M statisticzero, and does not show recent merging activity. We want to compare the M-statistic values over timeto understand the merging history of the Universe.

Redshift is a stretching effect on the light waves emitted by galaxies that arises from the expansion ofthe Universe. It is defined as the ratio of the change in wavelength and the emitted wavelength:

z =λobserved − λemitted

λemitted.

Since the light waves we observe of more distant galaxies were emitted longer ago, and thus undergomore stretching, redshift can be used as a measurement system for cosmic time. The problemof describing galaxy morphology evolution hence becomes studying the change in image featurestatistics as a function of redshift.

As is shown in Figure 6, the distribution of the M statistic is highly skewed. Outliers pull theconditional expectation line significantly up from the majority of points. Therefore, we use quantileregression in order to better quantify the change in distribution of the M statistic given redshift. Weshow the comparison of our two nonparametric quantile regression methods (quantile smoothingspline and RHKS method) in Figure 6. The quantile smoothing spline is fit with λ = 1. The quantileregression with RKHS norm penalization is fit with a Gaussian RBF (Radial Basis Function) kernelwhich chooses σ2 = 23.4 and C = 0.1. The RKHS method is able to recover more subtle trendsin the conditional quantiles, while the piecewise linear fit produced by quantile smoothing splineregression does not capture the finer details of the conditional quantiles.

The conditional median slowly increases across the entire redshift range and we see the most dramaticincrease in the conditional 3rd quantile (τ = 0.75) in the range 3 ≤ z ≤ 3.5. Scientifically, thismeans that the younger Universe had more galaxy merging activities.

7

0.000

0.005

0.010

0.015

0.020

0.025

0 1 2 3 4

Redshift

M s

tatis

tic

Quantiles

25th percentile

75th percentile

Median

Method

QSS

RKHS

Figure 6: The Multimodal (M) statistic vs. redshift. The distribution of the M-statistic is very skewedand thus suggests the use of quantile regression. The RKHS norm penalized quantile regressionappears to capture finer features in the underlying distribution. Both methods suggest that theM-statistic increases with redshift.

4 Concluding Remarks

The conditional expectation is most often the quantity of interest in regression but there are manyproblems that dictate the use of quantile regression. Linear regression can naturally be extendedto quantile estimation and nonparametric methodology is also well-developed in this context. Weshow that linear quantile regression can be expressed as a linear program and also analyze twononparametric methodologies—quantile smoothing splines and RKHS norm penalized quantileregression, each of which can also be expressed as convex problems. Our analysis of the methodsapplied on a complex simulated data set suggest that quantile regression with RKHS norm ismost successful in recovering the underlying structure. Nevertheless, this method is also the mostcomputationally expensive.

Our application of the nonparametric methods to galaxy morphology data (which is known to havecomplex structure) collected by the Hubble GOODS South Field further supports the conclusionthat quantile regression with RKHS norm is more sensitive to subtle fluctuations in the underlyingdistribution. Both the quantile smoothing splines and RKHS norm penalized quantile regression showan increasing trend in the value of M-statistic as a function of redshift. This implies that the Universehas progressively fewer merging activities occurring as it evolves in time.

References[1] R. Koenker and J. Bassett, Gilbert, “Regression quantiles,” Econometrica, vol. 46, no. 1, pp.

33–50, 1978.

[2] R. Koenker and K. Hallock, “Quantile regression: An introduction,” Journal of EconomicPerspectives, vol. 15, no. 4, pp. 43–56, 2001.

[3] R. Koenker, P. Ng, and S. Portnoy, “Quantile smoothing splines,” Biometrika, vol. 81, no. 4, pp.673–680, 1994.

[4] I. Takeuchi, Q. V. Le, T. D. Sears, and A. J. Smola, “Nonparametric quantile estimation,” TheJournal of Machine Learning Research, vol. 7, pp. 1231–1264, 2006.

[5] R. J. Bosch, Y. Ye, and G. G. Woodworth, “A convergent algorithm for quantile regression withsmoothing splines,” Computational Statistics & Data Analysis, vol. 19, no. 6, pp. 613 – 630,1995. [Online]. Available: http://www.sciencedirect.com/science/article/pii/016794739400018E

8

http://www.sciencedirect.com/science/article/pii/016794739400018E

[6] W. Hendricks and R. Koenker, “Hierarchical spline models for conditional quantiles and thedemand for electricity,” Journal of the American Statistical Association, vol. 87, no. 417, pp.58–68, 1992.

9

Documents

Convex Optimization Methods for Quantile Regressionstat.cmu.edu/~ceubanks/quantile_regression.pdfa data set of galaxy morphological statistics. Cosmological data often do not come