Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe...

Convex Optimization

Dani Yogatama

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA

February 12, 2014

Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 1 / 26

Key Concepts in Convex Analysis: Convex Sets

Key Concepts in Convex Analysis: Convex Functions

Key Concepts in Convex Analysis: Minimizers

Key Concepts in Convex Analysis: Strong ConvexityRecall the definition of convex function: ∀λ ∈ [0, 1],

f (λx + (1− λ)x ′) ≤ λf (x) + (1− λ)f (x ′)

A β−strongly convex function satisfies a stronger condition: ∀λ ∈ [0, 1]

f (λx + (1− λ)x ′) ≤ λf (x) + (1− λ)f (x ′)− β

2λ(1− λ)‖x − x ′‖22

convexity

strong convexity

Strong convexity⇒6⇐ strict convexity.

f (λx + (1− λ)x ′) ≤ λf (x) + (1− λ)f (x ′)

f (λx + (1− λ)x ′) ≤ λf (x) + (1− λ)f (x ′)− β

2λ(1− λ)‖x − x ′‖22

convexity strong convexity

f (λx + (1− λ)x ′) ≤ λf (x) + (1− λ)f (x ′)

f (λx + (1− λ)x ′) ≤ λf (x) + (1− λ)f (x ′)− β

2λ(1− λ)‖x − x ′‖22

convexity strong convexity

Key Concepts in Convex Analysis: SubgradientsConvexity ⇒ continuity; convexity 6⇒ differentiability (e.g., f (w) = ‖w‖1).

Subgradients generalize gradients for (maybe non-diff.) convex functions:

v is a subgradient of f at x if f (x′) ≥ f (x) + v>(x′ − x)

Subdifferential: ∂f (x) = {v : v is a subgradient of f at x}If f is differentiable, ∂f (x) = {∇f (x)}

linear lower bound non-differentiable case

Notation: ∇̃f (x) is a subgradient of f at x

Subdifferential: ∂f (x) = {v : v is a subgradient of f at x}

If f is differentiable, ∂f (x) = {∇f (x)}

linear lower bound

non-differentiable case

linear lower bound

non-differentiable case

Notation: ∇̃f (x) is a subgradient of f at xDani Yogatama (Carnegie Mellon University) Convex Optimization February 12, 2014 6 / 26

Establishing convexity

How to check if f (x) is a convex function?

Verify definition of a convex function.

Check if ∂2f (x)∂2x

greater than or equal to 0 (for twice differentiablefunction).

Show that it is constructed from simple convex functions withoperations that preserver convexity.

nonnegative weighted sumcomposition with affine functionpointwise maximum and supremumcompositionminimizationperspective

Reference: Boyd and Vandenberghe (2004)

Establishing convexity

How to check if f (x) is a convex function?

Verify definition of a convex function.

Check if ∂2f (x)∂2x

greater than or equal to 0 (for twice differentiablefunction).

Show that it is constructed from simple convex functions withoperations that preserver convexity.

nonnegative weighted sumcomposition with affine functionpointwise maximum and supremumcompositionminimizationperspective

Reference: Boyd and Vandenberghe (2004)

Unconstrained Optimization

Algorithms:

First order methods (gradient descent, FISTA, etc.)

Higher order methods (Newton’s method, ellipsoid, etc.)

Gradient descent

Problem:min

xf (x)

Algorithm:

gt = ∂f (xt)∂x .

xt = xt−1 − ηgt .

Repeat until convergence.

Gradient descent

Problem:min

xf (x)

Algorithm:

gt = ∂f (xt)∂x .

xt = xt−1 − ηgt .

Newton’s method

Problem:min

xf (x)

Assume f is twice differentiable.

Algorithm:

gt = ∂f (xt)∂x .

st = H−1gt , where H is the Hessian.

xt = xt−1 − ηst .

Newton’s method is a special case of steepest descent using Hessian norm.

Newton’s method

Problem:min

xf (x)

Algorithm:

gt = ∂f (xt)∂x .

Newton’s method

Problem:min

xf (x)

Algorithm:

gt = ∂f (xt)∂x .

DualityPrimal problem:

minx f (x)subject to gi (x) ≤ 0 i = 1, . . . ,m

hi (x) = 0 i = 1, . . . , p

for x ∈ X.

Lagrangian:

L(x , λ, ν) = f (x) +m∑

λigi (x) +

p∑i=1

νihi (x)

λi and νi are Lagrange multipliers.

Suppose x is feasible and λ ≥ 0, then we get the lower bound:

f (x) ≥ L(x , λ, ν)∀x ∈ X, λ ∈ Rm+

hi (x) = 0 i = 1, . . . , p

for x ∈ X.

Lagrangian:

L(x , λ, ν) = f (x) +m∑

λigi (x) +

p∑i=1

νihi (x)

f (x) ≥ L(x , λ, ν)∀x ∈ X, λ ∈ Rm+

hi (x) = 0 i = 1, . . . , p

for x ∈ X.

Lagrangian:

L(x , λ, ν) = f (x) +m∑

λigi (x) +

p∑i=1

νihi (x)

f (x) ≥ L(x , λ, ν)∀x ∈ X, λ ∈ Rm+

Duality

Primal optimal: p∗ = minx

maxλ≥0,ν

L(x , λ, ν)

Lagrange dual function: minx

L(x , λ, ν)

This is a concave function, regardless of whether f (x) convex or not. Canbe −∞ for some λ and ν.

Lagrange dual problem: maxλ,ν

L(x , λ, ν) subject to λ ≥ 0

Dual feasible: if λ ≥ 0 and λ, ν ∈ dom L(x , λ, ν).

Dual optimal: d∗ = maxλ≥0,ν

L(x , λ, ν)

Duality

maxλ≥0,ν

L(x , λ, ν)

Duality

maxλ≥0,ν

L(x , λ, ν)

Duality

maxλ≥0,ν

L(x , λ, ν)

DualityWeak duality p∗ ≥ d∗ always holds for convex and nonconvex problems

Strong duality p∗ = d∗ does not hold in general, but usually holds forconvex problems. Strong duality is guaranteed by Slater’s constraintqualification.

Strong duality holds if the problem is strictly feasible, i.e.

∃x ∈ intD s.t. gi (x) < 0, i = 1, . . . ,m, hi (x) = 0, i = 1, . . . , p

Assume strong duality holds and p∗ and d∗ are attained.

p∗ = f (x∗) = d∗ = minx

L(x∗, λ∗, ν∗) ≤ L(x∗, λ∗, ν∗) ≤ f (x∗) = p∗

We have:

x∗ ∈ arg minx L(x∗, λ∗, ν∗).

λ∗i gi (x∗) = 0 for i = 1, . . . ,m (complementary slackness).

p∗ = f (x∗) = d∗ = minx

L(x∗, λ∗, ν∗) ≤ L(x∗, λ∗, ν∗) ≤ f (x∗) = p∗

We have:

p∗ = f (x∗) = d∗ = minx

L(x∗, λ∗, ν∗) ≤ L(x∗, λ∗, ν∗) ≤ f (x∗) = p∗

We have:

p∗ = f (x∗) = d∗ = minx

L(x∗, λ∗, ν∗) ≤ L(x∗, λ∗, ν∗) ≤ f (x∗) = p∗

We have:

Karush-Kuhn-Tucker condition

For a differentiable g(x) and h(x), the KKT conditions are:

gi (x∗) ≤ 0, hi (x

∗) = 0, primal feasibility

λ∗i ≥ 0, dual feasibility

λ∗i gi (x∗) = 0, complementary slackness

∂L(x∗, λ∗, ν∗)

∂x|x=x∗ = 0 Lagrangian stationarity

If x̂ , λ̂, ν̂ satify the KKT for a convex problem, they are optimal.

Karush-Kuhn-Tucker condition

For a differentiable g(x) and h(x), the KKT conditions are:

gi (x∗) ≤ 0, hi (x

∗) = 0, primal feasibility

λ∗i ≥ 0, dual feasibility

λ∗i gi (x∗) = 0, complementary slackness

∂L(x∗, λ∗, ν∗)

∂x|x=x∗ = 0 Lagrangian stationarity

If x̂ , λ̂, ν̂ satify the KKT for a convex problem, they are optimal.

Support Vector MachinesPrimal problem (hard constraint):

2‖w‖22

subject to yi 〈xi ,w〉 ≥ 1, i = 1, . . . , n

Lagrangian:

L(w , λ) =1

2‖w‖22 −

n∑i=1

λi (yi 〈xi ,w〉 − 1)

Minimizing with respect to w, we have:

∂L(w ,λ)∂w = 0

w −∑n

i=1 λiyixi = 0

w =n∑

λiyixi

2‖w‖22

Lagrangian:

L(w , λ) =1

2‖w‖22 −

n∑i=1

λi (yi 〈xi ,w〉 − 1)

∂L(w ,λ)∂w = 0

w −∑n

i=1 λiyixi = 0

w =n∑

λiyixi

2‖w‖22

Lagrangian:

L(w , λ) =1

2‖w‖22 −

n∑i=1

λi (yi 〈xi ,w〉 − 1)

∂L(w ,λ)∂w = 0

w −∑n

i=1 λiyixi = 0

w =n∑

λiyixi

Support Vector MachinesPlug this back into the Lagrangian:

L(λ) =n∑

λi −1

n∑i=1

n∑j=1

yiyjλiλjx>i xj

Lagrange dual problem is:

n∑1=1

λi −1

n∑i=1

n∑j=1

yiyjλiλjx>i xj

subject to λi ≥ 0, i = 1, . . . , nn∑

λiyi = 0

Since this problem only depends on x>i xj , we can use kernels and learn inhigh dimensional space without having to explicitly represent φ(x).

L(λ) =n∑

λi −1

n∑i=1

n∑j=1

yiyjλiλjx>i xj

n∑1=1

λi −1

n∑i=1

n∑j=1

yiyjλiλjx>i xj

subject to λi ≥ 0, i = 1, . . . , nn∑

λiyi = 0

L(λ) =n∑

λi −1

n∑i=1

n∑j=1

yiyjλiλjx>i xj

n∑1=1

λi −1

n∑i=1

n∑j=1

yiyjλiλjx>i xj

subject to λi ≥ 0, i = 1, . . . , nn∑

λiyi = 0

Support Vector Machines

Primal problem (soft constraint):

2‖w‖22 + C

n∑i=1

subject to yi 〈xi ,w〉 ≥ 1− ξi , i = 1, . . . , n

ξi ≥ 0, i = 1 . . . , n

Lagrange dual problem for the soft constraint:

n∑1=1

λi −1

n∑i=1

n∑j=1

yiyjλiλjx>i xj (1)

subject to 0 ≤ λi ≤ C , i = 1, . . . , n (2)n∑

λiyi = 0 (3)

KKT conditions, for all i :

λi = 0 → yi 〈xi ,w〉 ≥ 1 (4)

0 < λi < C → yi 〈xi ,w〉 = 1 (5)

λi = C → yi 〈xi ,w〉 ≤ 1 (6)

Lagrange dual problem for the soft constraint:

n∑1=1

λi −1

n∑i=1

n∑j=1

yiyjλiλjx>i xj (1)

subject to 0 ≤ λi ≤ C , i = 1, . . . , n (2)n∑

λiyi = 0 (3)

KKT conditions, for all i :

λi = 0 → yi 〈xi ,w〉 ≥ 1 (4)

0 < λi < C → yi 〈xi ,w〉 = 1 (5)

λi = C → yi 〈xi ,w〉 ≤ 1 (6)

Sequential Minimal Optimization (Platt, 1998)

An efficient way to solve SVM dual problem. Break a large QP programinto a series of smallest possible QP problems. Solve these smallsubproblems analytically.

In a nutshell

Choose two Lagrange multipliers λi and λj .

Optimize the dual problem with respect to these two Lagrangemultipliers, holding others fixed.

There are heuristics to choose Lagrange multipliers that maximizes thestep size towards the global maximum. The first one is chosen fromexamples that violate the KKT condition. The second one is chosen usingapproximation based on absolute difference in error values (see Platt(1998)).

In a nutshell

For any two Lagrange multipliers, the constraints are::

0 < λi , λj < C (7)

yiλi + yjλj = −∑

k 6=i ,j ykλk = γ (8)

Express λi in terms of λj

λi =γ − λjyj

Plug this back into our original function. We are then left with a verysimple quadratic problem with one variable λj

0 < λi , λj < C (7)

k 6=i ,j ykλk = γ (8)

λi =γ − λjyj

0 < λi , λj < C (7)

k 6=i ,j ykλk = γ (8)

λi =γ − λjyj

Sequential Minimal Optimization (Platt, 1998)Solve for the second Lagrange multiplier λj .

If yi 6= yj , the following bounds apply to λj :

L = max(0, λt−1j − λy−1

i ) (9)

H = min(C ,C + λt−1j − λy−1

i ) (10)

If yi = yj , the following bounds apply to λj :

L = max(0, λt−1j + λy−1

i − C ) (11)

H = min(C , λt−1j + λy−1

i ) (12)

The solution is:

H ifλj > Hλj ifL ≤ λj ≤ HL ifλj < L

L = max(0, λt−1j − λy−1

i ) (9)

i ) (10)

L = max(0, λt−1j + λy−1

i − C ) (11)

i ) (12)

The solution is:

L = max(0, λt−1j − λy−1

i ) (9)

i ) (10)

L = max(0, λt−1j + λy−1

i − C ) (11)

i ) (12)

The solution is:

Fenchel dualityIf a convex conjugate of f (x) is known, the dual function can be easilyderived. The convex conjugate of a function f is:

f ∗(y) = maxx〈y , x〉 − f (x) (13)

For a generic problem

subject to Ax ≤ b

Cx = d

The dual function is: −f ∗(−A>λ− C>ν)− b>λ− d>ν

There are many functions whose conjugate are easy to compute:

ExponentialLogisticQuadratic formLog determinant...

f ∗(y) = maxx〈y , x〉 − f (x) (13)

subject to Ax ≤ b

Cx = d

f ∗(y) = maxx〈y , x〉 − f (x) (13)

subject to Ax ≤ b

Cx = d

Parting notes

Dual formulation is useful.

Give new insights into our problem,

Allow us to develop better optimization methods and use kernel tricks.

Thank you!

Questions?

References I

Some slides are from an upcoming EACL 2014 tutorial with Andre F. T.Martins, Noah A. Smith, and Mario F. T. Figueiredo

Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. CambridgeUniversity Press.

Platt, J. (1998). Fast training of support vector machines using sequentialminimal optimization. In Scholkopf, B., Burges, C., and Smola, A.,editors, Advances in Kernel Methods - Support Vector Learning. MITPress.

Convex Optimizationaarti/Class/10701_Spring14/r3_convex_opt.pdf · Reference: Boyd and Vandenberghe...

Documents

geometry of convex functions - CCRMAdattorro/gcf.pdfGeometry of convex functions The link between convex sets and convex functions is via the epigraph: A function is convex if and

Convex sets and convex functions - pages.di.unipi.itpages.di.unipi.it/passacantando/om/1-convexity.pdf · Convex sets Convex functions A ne sets Ana ne combinationof x and y is a

Convex Analysis

Polygons can be CONCAVE or CONVEX CONVEX CONCAVE

Non-convex optimizationVenn diagram. Non-convex functions. Non-convex functions local maxima Global minimum Local minima. Non-convex functions Global minimum ... Global minimum Local

CONVEX BODIES ASSOCIATED WITH A CONVEX BODY 781

1 Theory of convex functions - Princeton Universityaaa/Public/Teaching/ORF523/S...Convex, concave, strictly convex, and strongly convex functions First and second order characterizations

+ Convex Functions, Convex Sets and Quadratic Programs Sivaraman Balakrishnan

Polygons - Concave or convexPolygons - Concave & convex Write down if polygon is convex or concave: 1) convex 2) convex 3) convex 4) convex 5) convex 6) convex 1 7) convex 8) concave

convex hull

Convex Hulls - Tulane University• But convex hulls are cyclically ordered Split convex hull into upper and lower part upper convex hull UCH(P) lower convex hull LCH(P) 1/12/16 CMPS

Non-Convex Total Variation Regularization for Convex ...eeweb.poly.edu/iselesni/GME-TVD/GME-TVD-2020-JMIV-preprint.pdfNon-Convex Total Variation Regularization for Convex Denoising

Lecture 7: Convex Optimizations · Convex Optimization Problems The general form of a convex optimization problem: min x∈S f (x) where S is a closed convex set, and f is a convex

Convex and non-convex worlds in machine learningseminars/seminars/Extra/2015... · Intro Convex solver: bound majorization (Convex) objective: multi-classi cation Non-convexity: deep

Sparse Overcomplete Word Vector Representationshomes.cs.washington.edu/~nasmith/papers/faruqui... · Manaal Faruqui Yulia Tsvetkov Dani Yogatama Chris Dyer Noah A. Smith Language

Introduction to Machine Learning 10701aarti/Class/10701_Spring14/slides/ICA.pdfIndependent Component Analysis Barnabás Póczos & Aarti Singh . 2 Independent Component Analysis . 3

Convex hull smallest convex set containing all the points

Variational Methods in Convex and Non-Convex Plasticity

1 Convex Sets, and Convex Functionsangell/ch1.pdf · 1 Convex Sets, and Convex Functions Inthis section, we introduce oneofthemostimportantideas inthe theoryofoptimization, that of

Convex Optimization