18
The Foundations of (Machine) Learning Stefan Kühn codecentric AG CSMLS Meetup Hamburg - February 26th, 2015 Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 1 / 18

SKuehn_MachineLearningAndOptimization_2015

Embed Size (px)

Citation preview

Page 1: SKuehn_MachineLearningAndOptimization_2015

The Foundations of (Machine) Learning

Stefan Kühn

codecentric AG

CSMLS Meetup Hamburg - February 26th, 2015

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 1 / 18

Page 2: SKuehn_MachineLearningAndOptimization_2015

Contents

1 Supervised Machine Learning

2 Optimization Theory

3 Concrete Optimization Methods

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 2 / 18

Page 3: SKuehn_MachineLearningAndOptimization_2015

1 Supervised Machine Learning

2 Optimization Theory

3 Concrete Optimization Methods

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 3 / 18

Page 4: SKuehn_MachineLearningAndOptimization_2015

Setting

Supervised Learning ApproachUse labeled training data in order to fit a given model to the data, i.e. tolearn from the given data.

Typical Problems:Classification - discrete output

I Logistic RegressionI Neural NetworksI Support Vector Machines

Regression - continuous outputI Linear RegressionI Support Vector RegressionI Generalized Linear/Additive Models

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 4 / 18

Page 5: SKuehn_MachineLearningAndOptimization_2015

Training and Learning

Ingredients:Training Data SetModel, e.g. Logistic RegressionError Measure, e.g. Mean Squared Error

Learning Procedure:Derive objective function from Model and Error MeasureInitialize Model parametersFind a good fit!Iterate with other initial parameters

What is Learning in this context?Learning is nothing but the application of an algorithm for unconstrainedoptimization to the given objective function.

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 5 / 18

Page 6: SKuehn_MachineLearningAndOptimization_2015

1 Supervised Machine Learning

2 Optimization Theory

3 Concrete Optimization Methods

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 6 / 18

Page 7: SKuehn_MachineLearningAndOptimization_2015

Unconstrained Optimization

Higher-order methodsNewton’s method (fast local convergence)

Gradient-based methodsGradient Descent / Steepest Descent (globally convergent)Conjugate Gradient (globally convergent)Gauß-Newton, Levenberg-Marquardt, Quasi-NewtonKrylow Subspace methods

Derivative-free methods, direct searchSecant method (locally convergent)Regula Falsi and successors (global convergence, typically slow)Nelder-Mead / Downhill-Simplex

I unconventional method, creates a moving simplexI driven by reflection/contraction/expansion of the corner pointsI globally convergent for differentiable functions f ∈ C1

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 7 / 18

Page 8: SKuehn_MachineLearningAndOptimization_2015

General Iterative Algorithmic Scheme

Goal: Minimize a given function f :

min f (x), x ∈ Rn

Iterative AlgorithmsStarting from a given point an iterative algorithm tries to minimize theobjective function step by step.

Preparation: k = 0Initialization: Choose initial points and parameters

Iterate until convergence: k = 1, 2, 3, . . .Termination criterion: Check optimality of the current iterateDescent Direction: Find reasonable search directionStepsize: Determine length of the step in the given direction

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 8 / 18

Page 9: SKuehn_MachineLearningAndOptimization_2015

Termination criteria

Critical points x∗:

∇f (x∗) = 0

Gradient: Should converge to zero

‖∇f (x∗)‖ < tol

Iterates: Distance between xk and xk+1 should converge to zero∥∥∥xk − xk+1∥∥∥ < tol

Function Values: Difference between f (xk) and f (xk+1) shouldconverge to zero ∣∣∣f (xk)− f (xk+1)

∣∣∣ < tol

Number of iterations: Terminate after maxiter iterations

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 9 / 18

Page 10: SKuehn_MachineLearningAndOptimization_2015

Descent direction

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 10 / 18

Page 11: SKuehn_MachineLearningAndOptimization_2015

Descent directionGeometric interpretationd is a descent direction if and only if the angle α between the gradient∇f (x) and d is in a certain range:

π

2= 90◦ < α < 270◦ =

3π2

Algebraic equivalentThe sign of the scalar product between two vectors a and b is determinedby the cosine of the angle α between a and b:

〈a, b〉 = aTb = ‖a‖‖b‖ cosα(a, b)

d is a descent direction if and only if:

dT∇f (x) < 0

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 11 / 18

Page 12: SKuehn_MachineLearningAndOptimization_2015

Stepsize

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 12 / 18

Page 13: SKuehn_MachineLearningAndOptimization_2015

StepsizeArmijo’s ruleTakes two parameters 0 < σ < 1, and 0 < ρ < 0.5

For ` = 0, 1, 2, ... test Armijo condition:

f (p + σ`d) < f (p) + ρσ`dT∇f (p)

Accepted stepsizeFirst ` that passes this test determines the accepted stepsize

t = σ`

Standard Armijo implies, that for the accepted stepsize t always holdst <= 1, only semi-efficient.

Technical detail: WideningTest whether some t > 0 satisfy Armijo condition, i.e. check` = −1,−2, . . . as well, ensures efficiency.

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 13 / 18

Page 14: SKuehn_MachineLearningAndOptimization_2015

1 Supervised Machine Learning

2 Optimization Theory

3 Concrete Optimization Methods

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 14 / 18

Page 15: SKuehn_MachineLearningAndOptimization_2015

Gradient Descent

Descent directionDirection of Steepest Descent, the negative gradient:

d = −∇f (x)

Motivation:corresponds to α = 180◦ = π

obvious choice, always a descent direction, no test neededguarantees the quickest win locallyworks with inexact line search, e.g. Armijo’ s ruleworks for functions f ∈ C1

always solves auxiliary optimization problem

min sT∇f (x), s ∈ Rn, ‖s‖ = 1

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 15 / 18

Page 16: SKuehn_MachineLearningAndOptimization_2015

Conjugate GradientMotivation: Quadratic Model Problem, minimize

f (x) = ‖Ax − b‖2

Optimality condition:

∇f (x∗) = 2AT (Ax∗ − b) = 0

Obvious approach: Solve system of linear equations

ATAx = ATb

Descent directionConsecutive directions di , . . . , di+k satisfy certain orthogonality orconjugacy conditions, M = ATA symmetric positive definite:

dTi Mdj = 0, i 6= j

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 16 / 18

Page 17: SKuehn_MachineLearningAndOptimization_2015

Nonlinear Conjugate GradientInitial Steps:

start at point x0 with d0 = −∇f (x0)

perform exact line search, find

t0 = argmin f (x0 + td0), t > 0

set x1 = x0 + t0d0.Iteration:

set ∆k = −∇f (xk)

compute βk via one of the available formulas (next slide)update conjugate search direction dk = ∆k + βkdk−1

perform exact line search, find

tk = argmin f (xk + tdk), t > 0

set xk+1 = xk + tkdk

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 17 / 18

Page 18: SKuehn_MachineLearningAndOptimization_2015

Nonlinear Conjugate Gradient

Formulas for βk :Fletcher-Reeves

βFRk =∆T

k ∆k

∆Tk−1∆k−1

Polak-Ribière βPRk =∆T

k (∆k −∆k−1)

∆Tk−1∆k−1

Hestenes-Stiefel βHSk = −∆T

k (∆k −∆k−1)

sTk−1 (∆k −∆k−1)

Dai-Yuan βDYk = −∆T

k ∆k

sTk−1 (∆k −∆k−1)

Reasonable choice with automatic direction reset:

β = max{0, βPR

}Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 18 / 18