Upload
trinhkhanh
View
225
Download
0
Embed Size (px)
Citation preview
First-order methods for structured nonsmoothoptimization
Sangwoon Yun
Department of Mathematics EducationSungkyunkwan University
Oct 19, 2016Center for Mathematical Analysis & Computation, Yonsei University
Outline
I. Coordinate (Gradient) Descent Method
II. Incremental Gradient Method
III. (Linearized) Alternating Direction Method of Multipliers
Thank you!
I. Coordinate (Gradient) Descent Method
Structured Nonsmooth Optimization
minx
F (X ) = f (x) + P(x),
f : real-valued, (convex) smooth on domf
P: proper, convex, lsc
In particular, P: separable i.e., P(x) =∑n
j=1 Pj (xj )
Bound constraint:
P(x) =
0 if l ≤ x ≤ u;
∞ else,
where l ≤ u (possibly with −∞ or∞).
`1-norm: P(x) = λ‖x‖1 with λ > 0
or indicator function of linear constraints (Ax = b).
Bound-constrained Optimization
minl≤x≤u
f (x),
where f : <N → < is smooth, l ≤ u (possibly with −∞ or∞ components).
Can be reformulated as the following unconstrained optimization problem:
minx
f (x) + P(x),
where P(x) =
0 if l ≤ x ≤ u∞ else
.
`1-regularized Convex Minimization
1. `1-regularized linear least squares problem
Find x so that Ax − b ≈ 0 and x has “few” nonzeros.Formulate this as an unconstrained convex optimization problem:
minx∈<n
‖Ax − b‖22 + λ‖x‖1
2. `1-regularized logistic regression problem
minw∈<n−1,v∈<
1m
m∑i=1
log(1 + exp(−(wT ai + vbi ))) + λ‖w‖1,
where ai = bizi and (zi ,bi ) ∈ <n−1 × −1,1, i = 1, ...,m are a given set of(observed or training) examples.
Support Vector Machines
Support Vector Machines
Support Vector Classification
Training points : zi ∈ <p, i = 1, ...,n.
Consider a simple case with two classes (linear separable case):Define a vector a :
ai =
1 if zi in class 1−1 if zi in class 2
A hyperplane (0 = wT z − b) separates data with the maximal margin.Margin is the distance of the hyperplane to the nearest of the positiveand negative points.Nearest points lie on the planes ±1 = wT z − b
Support Vector Machines
Convex Quadratic Programming Problem: Support Vector Classification
The (original) Optimization Problem
mina,b
12‖a‖2
2
subject to zi(aT xi − b
)≥ 1, i = 1, ...,n.
The Modified Optimization Problem (allows, but penalizes, the failure ofa point to reach the correct margin)
mina,b,ξ
12‖a‖2
2 + Cn∑
i=1
ξi
subject to zi(aT xi − b
)≥ 1− ξi , ξi ≥ 0, i = 1, ...,n.
Support Vector Machines
SVM (Dual) Optimization Problem (Convex Quadratic Program)
minx12 xT Qx − eT x
subject to 0 ≤ xi ≤ C, i = 1, ...,n,aT x = 0,
where a ∈ −1,1n, 0 < C ≤ ∞, e = [1, ...,1]T , Q ∈ <n×n is a sym. pos.semidef. with Qij = aiajK (zi , zj ), K : <p ×<p → < (“kernel function”), andzi ∈ <p (“i th data point”), i = 1, ...,n.
Popular Choices of K :
linear kernel K (zi , zj ) = zTi zj
radial basis function kernel K (zi , zj ) = exp(−γ‖zi − zj‖22)
sigmoid kernel K (zi , zj ) = tanh(γzTi zj )
where γ is a constant.Q is an n × n fully dense matrix and even indefinite. (n ≥ 5000)
Rank Minimization
Now, imagine that we only observe a few entries of a data matrix. Then is itpossible to accurately guess the entries that we have not seen?
Netflix problem: Given a sparse matrix where Mij is the rating given by user ion movie j , predict the rating a user would assign to a movie he has not seen,i.e., we would like to infer users preference for unrated movies. (impossible!in general)
The problem is ill-posed. Intuitively, users’ preferences depend only on a fewfactors, i.e., rank(M) is small.
Thus can be formulated as the low-rank matrix completion problem (affinerank minimization):
minX∈<m×n
rank(X ) |Xij = Mij , (i , j) ∈ Ω
, (NP hard!)
where Ω is an index set of p observed entries.
Rank Minimization
Nuclear norm minimization:
minX∈<m×n
‖X‖∗ :=
m∑i=1
σi (X ) |Xij = Mij , (i , j) ∈ Ω.
where σi (X )’s are singular values of X .
a more general nuclear norm minimization problem:
minX∈<m×n
‖X‖∗ : A(X ) = b
.
When the matrix variable is restricted to be diagonal, the above problemreduces to the following `1-minimization problem:
minx∈<n
‖x‖1 : Ax = b
.
Nuclear Norm Regularized Least Squares Problem
If the observation b is contaminated with noise
consider the following nuclear norm regularized least squares problem:
minX∈<m×n
12‖A(X )− b‖2
2 + µ‖X‖∗.
where µ > 0 is a given parameter.
appeared in many applications of engineering and science including
collaborative filtering
global positioning
system identification
remote sensing
computer vision
Sparse Portfolio Selection
How to allocate an investor’s available capital into a prefixed set of assetswith the aims of maximizing the expected return and minimizing theinvestment risk.
Traditional Markowitz portfolio selection model:
minx
xT Qx
subject to µT x = β, eT x = 1, x ≥ 0,
where β is the desired expected return of the portfolio.
Modified models:
minx
xT Qx
subject to µT x = β, eT x = 1, x ≥ 0, ‖x‖0 ≤ K .
minx
‖x‖0
subject to µT x = β, xT Qx ≤ α, eT x = 1, x ≥ 0.
Sparse Covariance Selection
Undirected graphical models offer a way to describe and explain relationshipsamong a set of variables, central element of multivariate data analysis
From a sample covariance matrix, wish to estimate the true covariancematrix, in which some of entries of its inverse are zero, by maximizing`1-regularized log-likelihood:
maxX∈Sn
log detX −⟨
Σ, X⟩−∑
(i,j) 6∈V ρij |Xij |subject to Xij = 0, ∀(i , j) ∈ V ,
where
Σ ∈ Sn+ is an empirical covariance matrix and Σ is singular or nearly so
May want to impose structural conditions on Σ−1, such as conditionalindependence, which is reflected as zero entries in Σ−1.
V is a collection of all pairs of conditional independent nodes
ρij > 0: parameter controlling the trade-off between the goodness-of-fitand the sparsity of X
Sparse Covariance Selection
− log det X +⟨
Σ, X⟩
is strictly convex, cont. diff. on its domain Sn++, O(n3)
opers. to evaluate. In applications, n can exceed 5000.
Dual problem can be expressed:
minX∈Sn
− log det X − n
subject to |(X − Σ)ij | ≤ υij , i , j = 1, ...,n,
where υij = ρij for all (i , j) 6∈ V and υij =∞ for all (i , j) ∈ V .
Coordinate Descent Method
When P ≡ 0. Given x ∈ <n, Choose i ∈ N = 1, ...,n. Update
xnew = arg minu|uj=xj ∀j 6=i
f (u).
Repeat until convergence.
Gauss-Seidel rule: Choose i cyclically, 1, 2, ..., n, 1, 2, ...
Gauss-Southwell rule: Choose i with | ∂f∂xi
(x)| maximum.
Coordinate Descent Method
Properties:
If f convex, then every cluster point of the x-sequence is a minimizer.
If f nonconvex, then G-Seidel can cycle 1 but G-Southwell stillconverges.
Convergence is possible when P 6≡ 0 2
1M. J. D. Powell, On search directions for minimization algorithms, Math. Program. 4, (1973),193–201
2P. Tseng, Convergence of block coordinate descent method for nondifferentiableminimization, J. Optim. Theory Appl. 109, (2001), 473–492
Coord. Gradient Descent Method
Descent direction.For x ∈ domP, choose J ( 6= ∅) ⊆ N and H 0n, Then solve
mind|dj=0 ∀j 6∈J
∇f (x)T d +12
dT Hd + P(x + d)− P(x) direc.subprob
Let dH(x ;J ) and qH(x ;J ) be the opt. soln and obj. value of the direc.subprob.
Facts:
dH(x ;N ) = 0 ⇔ F ′(x ; d) ≥ 0 ∀d ∈ <n.
H is diagonal ⇒ dH(x ;J ) =∑j∈J
dH(x ; j), qH(x ;J ) =∑j∈J
qH(x ; j).
qH(x ;J ) ≤ − 12 dT Hd where d = dH(x ;J ).
Coord. Gradient Descent Method
This coord. grad. descent approach may be viewed as a hybrid ofgradient-projection and coordinate descent. In particular,
if J = N and P(x) =
0 if l ≤ x ≤ u∞ else
, then dH(x ;N ) is a scaled
gradient-projection direction for bound-constrained optimization.
if f is quadratic and we choose H = ∇2f (x), then dH(x ;J ) is a (block)coordinate descent direction.
If H is diagonal, then subproblems can be solved in parallel.
If P ≡ 0, then dH(x)j = −∇f (x)j/Hjj .
If P(x) =
0 if l ≤ x ≤ u∞ else
, then
dH(x)j = medianlj − xj ,−∇f (x)j/Hjj ,uj − xj.
If P is the 1-norm, thendH(x)j = −median(∇f (x)j − λ)/Hjj , xj , (∇f (x)j + λ)/Hjj.
Coord. Gradient Descent Method
Stepsize: Armijo rule
Choose α to be the largest element of βkk=0,1,... satisfying
F (x + αd)− F (x) ≤ σαqH(x ;J ) (0 < β < 1, 0 < σ < 1).
For the `1-regularized linear least squares problem, the minimization rule
α ∈ arg minF (x + td) | t ≥ 0
or the limited minimization rule
α ∈ arg minF (x + td) | 0 ≤ t ≤ s,
where 0 < s <∞, can also be used.
Coord. Gradient Descent Method
Choose J :
Gauss-Seidel rule:J cycles through 1, 2, ..., n.
Gauss-Southwell-r rule:
‖dD(x ;J )‖∞ ≥ υ‖dD(x ;N )‖∞
where 0 < υ ≤ 1, D 0n is diagonal (e.g., D = diag(H)).
Gauss-Southwell-q rule:
qD(x ;J ) ≤ υ qD(x ;N ),
Where 0 < υ ≤ 1, D 0n is diagonal (e.g., D = diag(H)).
Coord. Gradient Descent Method
Advantage of CGD
CGD method is simple, highly parallelizable, and is suited for solvinglarge-scale problems.
CGD not only has cheaper iterations than exact coordinate descent, italso has stronger global convergence properties.
Convergence Results
Global convergence If
0 ≺ λI D, H λI,
J is chosen by G-Seidel, G-Southwell-r , G-Southwell-q rule,
is chosen by Armijo rule,
then every cluster point of the x-sequence generated by CGD method is astationary point of F .
Convergence Results
Local convergence rate If
0 ≺ λI D, H λI,
J is chosen by G-Seidel or Gauss-Southwell-q rule,
α is chosen by Armijo rule,
in addition, if P and f satisfy any of the following assumptions, then thex-sequence generated by CGD method converges at R-linear rate.
C1 f is strongly convex, ∇f is Lipschitz cont. on domP.
C2 f is (nonconvex) quadratic. P is polyhedral.
C3 f (x) = g(Ex) + qT x , where E ∈ <m×N , q ∈ <N , g is stronglyconvex, ∇g is Lipschitz cont. on <m. P is polyhedral.
C4 f (x) = maxy∈Y(Ex)T y − g(y)+ qT x , where Y ⊆ <m ispolyhedral, E ∈ <m×N , q ∈ <N , g is strongly convex, ∇g isLipschitz cont. on <m. P is polyhedral.
Convergence Results
Complexity Bound
If f is convex with Lipschitz cont. grad., then the number of iterations forachieving ε-optimality is
Gauss-Seidel rule:
O(
n2Lr0
ε
),
where L is a Lipschitz constant andr0 = max
dist(x ,X ∗)2 | F (x) ≤ F (x0)
.
Gauss-Southwell-q rule:
O(
Lr0
υε+ max
0,
Lυ
ln(
e0
r0
)),
where e0 = F (x0)−minx∈X F (x).
Thank you!
II. Incremental Gradient Method
Sum of several functions
minx
F (x) := f (x) + P(x),
where c > 0, P : <n → (−∞,∞] is a proper, convex, lower semicontinuous(lsc) function, and
f (x) :=m∑
i=1
fi (x),
where each function fi is real-valued and smooth (i.e., continuouslydifferentiable) on <n.
Sum of several functions
In applications, m is often large (exceeding 104).
In this case, traditional gradient methods would be inefficient since theyrequire evaluating ∇fi (x) for all i before updating x .
Incremental gradient methods (IGM), in contrast, update x afterevaluation of ∇fi (x) for only one or a few i .
In the unconstrained case P ≡ 0, this method has the basic form
xk+1 = xk + αk∇fik (xk ), k = 0,1, . . . ,
where ik is chosen to cycle through 1, . . . ,m (i.e.,i0 = 1, i1 = 2, . . . , im−1 = m, im = 1, . . . ) and αk > 0.
Sum of several functions
For global convergence of IGMs, stepsize αk (also called “learning rate”)needs to diminish to zero, which can lead to slow convergence3.
If a constant stepsize is used, only convergence to an approximatesolution can be shown4.
Methods5 to overcome this difficulty were proposed. However, thesemethods need additional assumptions such as ∇fi (x) = 0 for all i at astationary point x to achieve global convergence without the stepsizetending to zero.
Moreover, its extension to P 6≡ 0 is problematic.
3D. P. Bertsekas, A new class of incremental gradient methods for least squares problems,SIAM J. Optim., 7 (1997), 913–926
4M. V. Solodov, Incremental gradient algorithms with stepsizes bounded away from zero,Comput. Optim. Appl., 11 (1998), 23–35
5P. Tseng, An incremental gradient(-projection) method with momentum term and adaptivestepsize rule, SIAM J. Optim., 8 (1998), 506–531
Sum of several functions
For the case of P ≡ 0, Blatt, Hero and Gauchman6 proposed a method:
gk = gk−1 +∇fik (xk )−∇fik (xk−m),
xk+1 = xk − αgk ,
with α > 0, g−1 =∑m
i=1∇fi (x i−m−1), x0, x−1, . . . , x−m ∈ <n given.
Computes the gradient of a single component function at each iteration.
But instead of updating x using this gradient, it uses the sum of the mmost recently computed gradients.
This method requires more storage (O(mn) instead of O(n)) and slightlymore communication/computation per iteration.
6D. Blatt, A. O. Hero, and H. Gauchman, A convergent incremental gradient method with aconstant step size, SIAM J. Optim., 18 (2007), pp. 29–51
IGM: Constant Stepsize
0. Choose x0, x−1, · · · ∈ domP and α ∈ (0,1]. Initialize k = 0. Go to 1.
1. Choose Hk 0 and 0 ≤ τ ki ≤ k for i = 1, . . . ,m, and compute gk , dk ,
and xk+1 by
gk =m∑
i=1
∇fi (xτki ),
dk = arg mind∈<n
〈gk ,d〉+
12〈d ,Hk d〉+ P(xk + d)
,
xk+1 = xk + αdk .
Increment k by 1 and return to 1.
The method of Blatt et al. corresponds to the special case of P ≡ 0, Hk = I,K = m − 1, and
τ ki =
k if i = (k mod m) + 1;
τ k−1i otherwise,
1 ≤ i ≤ m, k ≥ m,
IGM: Constant Stepsize
Assumption 1
(a) τ ki ≥ k − K for all i and k , where K ≥ 0 is an integer.
(b) λI Hk λI for all k , where 0 < λ ≤ λ.
Assumption 2
‖∇fi (y)−∇fi (z)‖ ≤ Li‖y − z‖ ∀y , z ∈ domP,
for some Li ≥ 0, i = 1, . . . ,m. Let L =∑m
i=1 Li .
IGM: Constant Stepsize
Global Convergence
Let xk, dk, Hk be sequences generated by Algorithm 1 underAssumptions 1 and 2, and with α < 2λ/(L(2K + 1)).Then dk → 0 and every cluster point of xk is a stationary point.
IGM: Adaptive Stepsize
0. Choose x0, x−1, · · · ∈ domP, β ∈ (0,1), σ > 12 , and α ∈ (0,1]. Initialize
k = 0. Go to 1.
1. Choose Hk 0 and 0 ≤ τ ki ≤ k for i = 1, . . . ,m, compute gk , dk by
gk =m∑
i=1
∇fi (xτki ),
dk = arg mind∈<n
〈gk ,d〉+
12〈d ,Hk d〉+ cP(xk + d)
.
Choose αinit
k ∈ [α,1] and let αk be the largest element of αinit
k βjj=0,1,...
satisfying the descent-like condition
Fc(xk + αk dk )− Fc(xk ) ≤ −σKL‖αk dk‖2 +L2
k−1∑j=(k−K )+
‖αjd j‖2
and set xk+1 = xk + αdk . Increment k by 1 and return to 1.
IGM: Adaptive Stepsize
Global Convergence
Let xk, dk, Hk, αk be sequences generated by Algorithm 2 underAssumptions 1 and 2. Then the following results hold.
(a) For each k ≥ 0, the descent-like condition holds whenever αk ≤ α,where α = λ
L(σK+K/2+1/2) .
(b) We have αk ≥ minα, βα for all k .
(c) dk → 0 and every cluster point of xk is a stationary point.
IGM: 1-memory with Adaptive Stepsize
0. Choose x0 ∈ domP. Initialize k = 0 and g−1 = 0. Go to 1.
1. Choose Hk 0 and compute gk , dk , and xk+1 by
gk =k
k + 1gk−1 +
mk + 1
∇fik (xk ) with ik = (k mod m) + 1,
dk = arg mind∈<n
〈gk ,d〉+
12〈d ,Hk d〉+ cP(xk + d)
,
xk+1 = xk + αk dk ,
with αk ∈ (0,1].Increment k by 1 and return to 1.
IGM: 1-memory with Adaptive Stepsize
Assumption 3
(a)∞∑
k=0
αk =∞.
(b) lim`→∞
∑j=0
j + 1`+ 1
δj = 0, where δj := maxi=0,1,...,m
‖xk+i − xk+m‖∣∣∣∣k=jm−1
(x−1 = x0).
IGM: 1-memory with Adaptive Stepsize
Global Convergence
Let xk, dk, Hk, αk be sequences generated by Algorithm 3 underAssumptions 1(b), 2, and 3. Then the following results hold.
(a) ‖xk+1 − xk‖ → 0 and ‖∇f (xk )− gk‖ → 0.
(b) lim infk→∞ ‖dk‖ = 0.
(c) If xk is bounded, then there exists a cluster point of xk that is astationary point.
Thank you!
III. (Linearized) Alternating Direction Method ofMultipliers
Total Variation Regularized Linear Least SquaresProblemImage restorations such as image deconvolution, image inpainting and imagedenoising is often formulated as an inverse problem
b = Au + η,
unknown true image u ∈ <n.
observed image (or measurements) b ∈ <`.
η is a Gaussian noise.
A ∈ <`×n is a linear operator, typically a convolution operator indeconvolution, a projection in inpainting and the identity in denoising.
The unknown images can be recovered by solving TV regularized linear leastsquares problem (Primal):
minx∈<n
12‖Ax − b‖2
2 + µ‖∇x‖,
where µ > 0 and ‖∇x‖ =∑n
i=1 ‖(∇x)i‖2 with (∇x)i ∈ <2.
Gaussian Noise
Figure: denoising
(a) original: 512 × 512 (b) noised image (c) recovered image
Gaussian Noise
Figure: deblurring
(a) original: 512 × 512 (b) motion blurred image (c) recovered image
Gaussian Noise
Figure: inpainting
(a) original: 512 × 512 (b) scratched image (c) recovered image
AlgorithmsSaddle Point Formulation and Dual Problem
1. Dual Norm:min
umax‖p‖∗≤1
〈∇u, p〉+12‖Au − b‖2
2.
2. Convex Conjugate (Legendre-Fenchel transform) of J(u) = ‖∇u‖:
minu
supp〈p, ∇u〉 − J∗(p) +
12‖Au − b‖2
2.
where J∗(p) = supz 〈p, z〉 − J(z).
3. Lagrangian Function:
maxp
infu,zL1(u,p, z).
whereL1(u,p, z) =:
12‖Au − b‖2
2 + ‖z‖+ 〈p, ∇u − z〉
Algorithms
4. Dual Problem:
maxp
infu〈p, ∇u〉 − J∗(p) +
12‖Au − b‖2
2 = −J∗(p)− H∗(div p)
.
where H(u) = 12‖Au − b‖2
2.
5. Lagrangian Function with y = div p:
maxu
infp,yL2(u,p, y).
whereL2(u,p, y) =: J∗(p) + H∗(div p) + 〈u, div p − y〉
ADMM
Alternating Direction Method of Multipliers7
Augmented Lagrangian Function:
Lα(u,p, z) =:12‖Au − b‖2
2 + ‖z‖+ 〈p, ∇u − z〉+α
2‖∇u − z‖2
2
uk+1 = arg min
u
12‖Au − b‖2
2 +⟨pk , ∇u − zk
⟩+ α
2 ‖∇u − zk‖22
zk+1 = arg mind‖z‖+
⟨pk , ∇uk+1 − z
⟩+ α
2 ‖∇uk+1 − z‖22
pk+1 = pk + α(zk+1 −∇uk+1)
ADMM on primal (or 3) is equivalent to Douglas-Rachford8 on dual (4).ADMM on dual (or 5) is equivalent to Douglas-Rachford on primal.
7E. Esser, X. Zhang, and T. Chan, A general framework for a class of first order primal-dualalgorithms for convex optimization in imaging science, SIAM J. Imaging Sci., 2010
8J. Douglas and H. H. Rachford, On the numerical solution of heat conduction problems intwo or three space variables, Trans. Amer. Math. Soc., 1956
ADMMAlternating Minimization Algorithm & Forward-Backward Splitting Method
AMA9 on primal (or 3)uk+1 = arg min
u
12‖Au − b‖2
2 +⟨pk , ∇u − zk
⟩zk+1 = arg min
d‖z‖+
⟨pk , ∇uk+1 − z
⟩+ α
2 ‖∇uk+1 − z‖22
pk+1 = pk + α(zk+1 −∇uk+1)
FBS10 on dual (4) uk+1 = arg minu
12‖Au − b‖2
2 +⟨pk , ∇u
⟩pk+1 = arg min
dJ∗(p)−
⟨p, ∇uk+1
⟩+ 1
2α‖p − pk‖22
FBS on dual is equivalent to AMA on primal (or 3).
9P. Tseng, Applications of a splitting algorithm to decomposition in convex programming andvariational inequalities, SIAM J. Control and Optimization, 1991
10P. L. Combettes and V. R. Wajs, Signal recovery by proximal forward-backward splitting,Multiscale Model. Simul., 2005
Proximal Splitting Method
Linearly Constrained Separable Convex Programming Problem:
minx,yf (x) + g(y) | Qx = y .
Augmented Lagrangian Function:
Lα(x , y , z) := f (x) + g(y) + 〈z, y −Qx〉+α
2‖Qx − y‖2
2.xk+1 = arg min
uLα(x , yk , zk )
yk+1 = arg minzLα(xk+1, y , zk )
zk+1 = zk + α(Qxk+1 − yk+1)
Proximal Splitting Method
If f (x) = 12‖Ax − b‖2
2,
xk+1 = (AT A + αQT Q)−1(AT b + QT zk + αQT yk ).
If f (x) = 〈1, Ax〉 − 〈b, log(Ax)〉Need inner solver or inversion involving Q and/or A.
If f (x) = 〈x + be−x , 1〉Need inner solver.
Poisson Noise
Figure: deblurring
(a) original (b) blurred image (c) recovered image
Multiplicative Noise
Multiplicative Noise
Copyright : Sandia Nat. Lab. (http://www.sandia.gov/RADAR/sar.html)
Multiplicative Noise
4M (222) pixel @ 20sec.
Multiplicative Noise
Multiplicative Noise
Multiplicative Noise
Proximal Splitting Method
In order to avoid any inner iterations or inversions involving Laplacianoperator required in algorithms based on augment Lagrangian.
Alternating minimization algorithmuk+1 = L0(x , yk , zk )zk+1 = Lα(xk+1, y , zk )pk+1 = pk + α(zk+1 −∇uk+1)
Linearized augmented Lagrangian with proximal functionuk+1 = LLα(x , yk , zk )zk+1 = Lα(xk+1, y , zk )pk+1 = pk + α(zk+1 −∇uk+1)
LLα(x , xk , yk , zz) :=⟨∇x f (xk ), x − xk⟩+ g(yk ) +
⟨zk , yk −Qx
⟩+⟨αQT (Qxk − yk ), x − xk⟩+
12δ‖x − xk‖2
2
Thank you!Thank you!