17
Introduction to Machine Learning Lecture 7 Mehryar Mohri Courant Institute and Google Research [email protected]

Introduction to Machine Learning Lecture 7mohri/mlu/mlu_lecture_7.pdf · Mehryar Mohri - Introduction to Machine Learning page • Since the constraints are qualified, if is solution,

  • Upload
    others

  • View
    21

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Machine Learning Lecture 7mohri/mlu/mlu_lecture_7.pdf · Mehryar Mohri - Introduction to Machine Learning page • Since the constraints are qualified, if is solution,

Introduction to Machine LearningLecture 7

Mehryar MohriCourant Institute and Google Research

[email protected]

Page 2: Introduction to Machine Learning Lecture 7mohri/mlu/mlu_lecture_7.pdf · Mehryar Mohri - Introduction to Machine Learning page • Since the constraints are qualified, if is solution,

Convex Optimization

Page 3: Introduction to Machine Learning Lecture 7mohri/mlu/mlu_lecture_7.pdf · Mehryar Mohri - Introduction to Machine Learning page • Since the constraints are qualified, if is solution,

pageMehryar Mohri - Introduction to Machine Learning

Differentiation

Definition: let be a differentiable function, then the gradient of at is defined by

Definition: let be a twice differentiable function, then the Hessian of at is defined by

3

f : X ⊆ RN → Rf x ∈ X

∇f(x) =

∂f∂x1

(x)...

∂f∂xN

(x)

.

f : X ⊆ RN → Rx ∈ Xf

∇2f(x) =�

∂2f

∂xi∂xj(x)

1≤i,j≤N

.

Page 4: Introduction to Machine Learning Lecture 7mohri/mlu/mlu_lecture_7.pdf · Mehryar Mohri - Introduction to Machine Learning page • Since the constraints are qualified, if is solution,

pageMehryar Mohri - Introduction to Machine Learning

Unconstrained Optimization

Theorem: let be a differentiable function. If admits a local extremum at , then

• is a stationary point.

• a local minimum is a global minimum if the function is convex.

4

(Fermat, 1629)

f : X ⊆ RN → Rf x∗ ∈ X

∇f(x∗) = 0.

x∗

Page 5: Introduction to Machine Learning Lecture 7mohri/mlu/mlu_lecture_7.pdf · Mehryar Mohri - Introduction to Machine Learning page • Since the constraints are qualified, if is solution,

pageMehryar Mohri - Introduction to Machine Learning

Convexity

Definition: is said to be convex if for any two points the segment lies in :

Definition: let be a convex set. A function is said to be convex if for all and ,

5

X⊆RN

x, y∈X [x, y] X

{αx + (1− α)y, 0 ≤ α ≤ 1} ⊆ X.

f : X→RXx, y∈X α∈ [0, 1]

f(αx + (1 − α)y) ≤ αf(x) + (1− α)f(y).

is said to be concave when is convex.f −f

With a strict inequality, is said to be strictly convex.f

Page 6: Introduction to Machine Learning Lecture 7mohri/mlu/mlu_lecture_7.pdf · Mehryar Mohri - Introduction to Machine Learning page • Since the constraints are qualified, if is solution,

pageMehryar Mohri - Introduction to Machine Learning 6

Theorem: let be a differentiable function. Then, is convex iff is convex and

Theorem: let be a twice differentiable function. Then, is convex iff its Hessian is positive semi-definite:

Properties of Convex Functions

f f

∀x, y ∈ dom(f), f(y) − f(x) ≥ ∇f(x) · (y − x).

f(y)

f(x) + ∇f(x)·(y − x).

(x, f(x))

ff

∀x ∈ dom(f), ∇2f(x) $ 0.

dom(f)

Page 7: Introduction to Machine Learning Lecture 7mohri/mlu/mlu_lecture_7.pdf · Mehryar Mohri - Introduction to Machine Learning page • Since the constraints are qualified, if is solution,

pageMehryar Mohri - Introduction to Machine Learning

Constrained Optimization Problem

Problem: Let and , . A constrained optimization problem has the form:

• no convexity assumption.

• can be augmented with equality constraints.

• primal problem.

• optimal value .

7

X⊆RN f, gi : X→R i∈ [1, m]

minx∈X

f(x)

subject to: gi(x) ≤ 0, i ∈ [1, m].

p∗

Page 8: Introduction to Machine Learning Lecture 7mohri/mlu/mlu_lecture_7.pdf · Mehryar Mohri - Introduction to Machine Learning page • Since the constraints are qualified, if is solution,

pageMehryar Mohri - Introduction to Machine Learning

Lagrangian/Lagrange Function

Definition: the Lagrange function or Lagrangian associated to a constraint problem is the function defined by:

8

s are called Lagrange or dual variables.αi

∀x ∈ X, ∀α ≥ 0, L(x, α) = f(x) +m�

i=1

αigi(x).

Page 9: Introduction to Machine Learning Lecture 7mohri/mlu/mlu_lecture_7.pdf · Mehryar Mohri - Introduction to Machine Learning page • Since the constraints are qualified, if is solution,

pageMehryar Mohri - Introduction to Machine Learning

Lagrange Dual Function

Definition: the (Lagrange) dual function associated to the constraint optimization problem is defined by

• is always concave: Lagrangian is linear with respect to and preserves concavity.

• : for a feasible ,

9

Finfα

∀α ≥ 0, F (α) ≤ p∗

f(x) +m�

i=1

αigi(x) ≤ f(x).

x

∀α ≥ 0, F (α) = infx∈X

L(x, α)

= infx∈X

f(x) +m�

i=1

αigi(x).

Page 10: Introduction to Machine Learning Lecture 7mohri/mlu/mlu_lecture_7.pdf · Mehryar Mohri - Introduction to Machine Learning page • Since the constraints are qualified, if is solution,

pageMehryar Mohri - Introduction to Machine Learning

Dual Optimization Problem

Definition: the dual (optimization) problem associated to the constraint optimization is

• always a convex optimization problem.

• optimal value .

10

maxα

F (α)

subject to: α ≥ 0.

d∗

Page 11: Introduction to Machine Learning Lecture 7mohri/mlu/mlu_lecture_7.pdf · Mehryar Mohri - Introduction to Machine Learning page • Since the constraints are qualified, if is solution,

pageMehryar Mohri - Introduction to Machine Learning

Weak and Strong Duality

Weak duality: .

• always holds (clear from previous observations).

Strong duality: .

• does not hold in general.

• holds for convex problems with constraint qualifications.

11

d∗ ≤ p∗

d∗ = p∗

Page 12: Introduction to Machine Learning Lecture 7mohri/mlu/mlu_lecture_7.pdf · Mehryar Mohri - Introduction to Machine Learning page • Since the constraints are qualified, if is solution,

pageMehryar Mohri - Introduction to Machine Learning

Constraint Qualification

Definition: Assume that Then, the following is the strong constraint qualification or Slater’s condition:

Definition: Assume that Then, the following is the weak constraint qualification or Slater’s condition:

12

intX �=∅.

∃x ∈ intX: g(x) < 0.

intX �=∅.

∃x ∈ intX: ∀i ∈ [1, m],�gi(x) < 0

�∨

�gi(x) = 0 ∧ gi affine

�.

Page 13: Introduction to Machine Learning Lecture 7mohri/mlu/mlu_lecture_7.pdf · Mehryar Mohri - Introduction to Machine Learning page • Since the constraints are qualified, if is solution,

pageMehryar Mohri - Introduction to Machine Learning

Saddle Point - Sufficient Condition

Theorem: Let P be a constrained optimization problem over . If is a saddle point, that is

Proof: By the first inequality,

• In view of that, the second inequality gives

(Lagrange, 1797)

13

X =RN

then it is a solution of P.

(x∗, α∗)∀x ∈ RN , ∀α ≥ 0, L(x∗, α) ≤ L(x∗, α∗) ≤ L(x, α∗),

∀x, L(x∗, α∗) ≤ L(x, α∗)⇒ ∀x, f(x∗) ≤ f(x) + α∗ · g(x).

Thus, for all such that ,

∀α ≥ 0, L(x∗, α) ≤ L(x∗, α∗)⇒ ∀α ≥ 0, α · g(x∗) ≤ α∗ · g(x∗)(use α→ +∞ then α→ 0)⇒ g(x∗) ≤ 0 ∧α∗ · g(x∗) = 0.

g(x)≤0 f(x∗) ≤ f(x).x

Page 14: Introduction to Machine Learning Lecture 7mohri/mlu/mlu_lecture_7.pdf · Mehryar Mohri - Introduction to Machine Learning page • Since the constraints are qualified, if is solution,

pageMehryar Mohri - Introduction to Machine Learning

Saddle Point - Necessary Conditions

Theorem: Assume that and , , are convex functions and that Slater’s condition holds. If is a solution of the constrained optimization problem, then there exists such that is a saddle point of the Lagrangian.

Theorem: Assume that and , , are convex differentiable functions and that the weak Slater’s condition holds. If is a solution of the constrained optimization problem, then there exists such that is a saddle point of the Lagrangian.

14

f gi i∈ [1, m]

(x, α)α≥0x

f gi i∈ [1, m]

(x, α)α≥0

x

Page 15: Introduction to Machine Learning Lecture 7mohri/mlu/mlu_lecture_7.pdf · Mehryar Mohri - Introduction to Machine Learning page • Since the constraints are qualified, if is solution,

pageMehryar Mohri - Introduction to Machine Learning

Kuhn-Tucker’s Theorem

Theorem: Assume that , are convex and differentiable and that the constraints are qualified. Then is a solution of the constrained program iff there exists such that:

Note: Last two conditions equivalent to

(Karush 1939; Kuhn-Tucker, 1951)

�KKT

conditions

15

f, gi : X→R i∈ [1, m]

xα≥0

∇xL(x, α) = ∇xf(x) + α ·∇xg(x) = 0

∇αL(x, α) = g(x) ≤ 0

α · g(x) =m�

i=1

αig(xi) = 0.

�g(x) ≤ 0

�∧

�∀i ∈ [1, m], αigi(x) = 0� �� �complementary conditions

�.

Page 16: Introduction to Machine Learning Lecture 7mohri/mlu/mlu_lecture_7.pdf · Mehryar Mohri - Introduction to Machine Learning page • Since the constraints are qualified, if is solution,

pageMehryar Mohri - Introduction to Machine Learning

• Since the constraints are qualified, if is solution, then there exists such that is a saddle point. In that case, the three conditions are verified (for the 3rd condition see proof of sufficient condition slide).

• Conversely, assume that the conditions are verified. Then, for any such that ,

16

x(x, α)α

f(x)− f(x) ≥ ∇xf(x) · (x− x) (convexity of f)

≥ −m�

i=1

αi∇xgi(x) · (x− x) (first condition)

≥ −m�

i=1

αi[gi(x) − gi(x)] (convexity of gis)

≥ −m�

i=1

αigi(x) ≥ 0. (third and second condition)

x g(x)<0

Page 17: Introduction to Machine Learning Lecture 7mohri/mlu/mlu_lecture_7.pdf · Mehryar Mohri - Introduction to Machine Learning page • Since the constraints are qualified, if is solution,

Courant Institute, NYUpageMehryar Mohri - Introduction to Machine Learning

References• Stephen Boyd and Lieven Vandenberghe. Convex Optimization, Cambridge University Press,

2004.

17