SKuehn_MachineLearningAndOptimization_2015

The Foundations of (Machine) Learning

Stefan Kühn

codecentric AG

CSMLS Meetup Hamburg - February 26th, 2015

Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 1 / 18

Contents

1 Supervised Machine Learning

2 Optimization Theory

3 Concrete Optimization Methods






Setting

Supervised Learning ApproachUse labeled training data in order to fit a given model to the data, i.e. tolearn from the given data.

Typical Problems:Classification - discrete output

I Logistic RegressionI Neural NetworksI Support Vector Machines

Regression - continuous outputI Linear RegressionI Support Vector RegressionI Generalized Linear/Additive Models


Training and Learning

Ingredients:Training Data SetModel, e.g. Logistic RegressionError Measure, e.g. Mean Squared Error

Learning Procedure:Derive objective function from Model and Error MeasureInitialize Model parametersFind a good fit!Iterate with other initial parameters

What is Learning in this context?Learning is nothing but the application of an algorithm for unconstrainedoptimization to the given objective function.






Unconstrained Optimization

Higher-order methodsNewton’s method (fast local convergence)

Gradient-based methodsGradient Descent / Steepest Descent (globally convergent)Conjugate Gradient (globally convergent)Gauß-Newton, Levenberg-Marquardt, Quasi-NewtonKrylow Subspace methods

Derivative-free methods, direct searchSecant method (locally convergent)Regula Falsi and successors (global convergence, typically slow)Nelder-Mead / Downhill-Simplex

I unconventional method, creates a moving simplexI driven by reflection/contraction/expansion of the corner pointsI globally convergent for differentiable functions f ∈ C1


General Iterative Algorithmic Scheme

Goal: Minimize a given function f :

min f (x), x ∈ Rn

Iterative AlgorithmsStarting from a given point an iterative algorithm tries to minimize theobjective function step by step.

Preparation: k = 0Initialization: Choose initial points and parameters

Iterate until convergence: k = 1, 2, 3, . . .Termination criterion: Check optimality of the current iterateDescent Direction: Find reasonable search directionStepsize: Determine length of the step in the given direction


Termination criteria

Critical points x∗:

∇f (x∗) = 0

Gradient: Should converge to zero

‖∇f (x∗)‖ < tol

Iterates: Distance between xk and xk+1 should converge to zero∥∥∥xk − xk+1∥∥∥ < tol

Function Values: Difference between f (xk) and f (xk+1) shouldconverge to zero ∣∣∣f (xk)− f (xk+1)

∣∣∣ < tol

Number of iterations: Terminate after maxiter iterations


Descent direction


Descent directionGeometric interpretationd is a descent direction if and only if the angle α between the gradient∇f (x) and d is in a certain range:

π

2= 90◦ < α < 270◦ =

3π2

Algebraic equivalentThe sign of the scalar product between two vectors a and b is determinedby the cosine of the angle α between a and b:

〈a, b〉 = aTb = ‖a‖‖b‖ cosα(a, b)

d is a descent direction if and only if:

dT∇f (x) < 0


Stepsize


StepsizeArmijo’s ruleTakes two parameters 0 < σ < 1, and 0 < ρ < 0.5

For ` = 0, 1, 2, ... test Armijo condition:

f (p + σ`d) < f (p) + ρσ`dT∇f (p)

Accepted stepsizeFirst ` that passes this test determines the accepted stepsize

t = σ`

Standard Armijo implies, that for the accepted stepsize t always holdst <= 1, only semi-efficient.

Technical detail: WideningTest whether some t > 0 satisfy Armijo condition, i.e. check` = −1,−2, . . . as well, ensures efficiency.






Gradient Descent

Descent directionDirection of Steepest Descent, the negative gradient:

d = −∇f (x)

Motivation:corresponds to α = 180◦ = π

obvious choice, always a descent direction, no test neededguarantees the quickest win locallyworks with inexact line search, e.g. Armijo’ s ruleworks for functions f ∈ C1

always solves auxiliary optimization problem

min sT∇f (x), s ∈ Rn, ‖s‖ = 1


Conjugate GradientMotivation: Quadratic Model Problem, minimize

f (x) = ‖Ax − b‖2

Optimality condition:

∇f (x∗) = 2AT (Ax∗ − b) = 0

Obvious approach: Solve system of linear equations

ATAx = ATb

Descent directionConsecutive directions di , . . . , di+k satisfy certain orthogonality orconjugacy conditions, M = ATA symmetric positive definite:

dTi Mdj = 0, i 6= j


Nonlinear Conjugate GradientInitial Steps:

start at point x0 with d0 = −∇f (x0)

perform exact line search, find

t0 = argmin f (x0 + td0), t > 0

set x1 = x0 + t0d0.Iteration:

set ∆k = −∇f (xk)

compute βk via one of the available formulas (next slide)update conjugate search direction dk = ∆k + βkdk−1

perform exact line search, find

tk = argmin f (xk + tdk), t > 0

set xk+1 = xk + tkdk


Nonlinear Conjugate Gradient

Formulas for βk :Fletcher-Reeves

βFRk =∆T

k ∆k

∆Tk−1∆k−1

Polak-Ribière βPRk =∆T

k (∆k −∆k−1)

∆Tk−1∆k−1

Hestenes-Stiefel βHSk = −∆T

k (∆k −∆k−1)

sTk−1 (∆k −∆k−1)

Dai-Yuan βDYk = −∆T

k ∆k

sTk−1 (∆k −∆k−1)

Reasonable choice with automatic direction reset:

β = max{0, βPR

}Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 18 / 18

Data & Analytics

SKuehn_MachineLearningAndOptimization_2015