Upload
stefan-kuehn
View
216
Download
0
Embed Size (px)
Citation preview
The Foundations of (Machine) Learning
Stefan Kühn
codecentric AG
CSMLS Meetup Hamburg - February 26th, 2015
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 1 / 18
Contents
1 Supervised Machine Learning
2 Optimization Theory
3 Concrete Optimization Methods
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 2 / 18
1 Supervised Machine Learning
2 Optimization Theory
3 Concrete Optimization Methods
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 3 / 18
Setting
Supervised Learning ApproachUse labeled training data in order to fit a given model to the data, i.e. tolearn from the given data.
Typical Problems:Classification - discrete output
I Logistic RegressionI Neural NetworksI Support Vector Machines
Regression - continuous outputI Linear RegressionI Support Vector RegressionI Generalized Linear/Additive Models
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 4 / 18
Training and Learning
Ingredients:Training Data SetModel, e.g. Logistic RegressionError Measure, e.g. Mean Squared Error
Learning Procedure:Derive objective function from Model and Error MeasureInitialize Model parametersFind a good fit!Iterate with other initial parameters
What is Learning in this context?Learning is nothing but the application of an algorithm for unconstrainedoptimization to the given objective function.
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 5 / 18
1 Supervised Machine Learning
2 Optimization Theory
3 Concrete Optimization Methods
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 6 / 18
Unconstrained Optimization
Higher-order methodsNewton’s method (fast local convergence)
Gradient-based methodsGradient Descent / Steepest Descent (globally convergent)Conjugate Gradient (globally convergent)Gauß-Newton, Levenberg-Marquardt, Quasi-NewtonKrylow Subspace methods
Derivative-free methods, direct searchSecant method (locally convergent)Regula Falsi and successors (global convergence, typically slow)Nelder-Mead / Downhill-Simplex
I unconventional method, creates a moving simplexI driven by reflection/contraction/expansion of the corner pointsI globally convergent for differentiable functions f ∈ C1
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 7 / 18
General Iterative Algorithmic Scheme
Goal: Minimize a given function f :
min f (x), x ∈ Rn
Iterative AlgorithmsStarting from a given point an iterative algorithm tries to minimize theobjective function step by step.
Preparation: k = 0Initialization: Choose initial points and parameters
Iterate until convergence: k = 1, 2, 3, . . .Termination criterion: Check optimality of the current iterateDescent Direction: Find reasonable search directionStepsize: Determine length of the step in the given direction
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 8 / 18
Termination criteria
Critical points x∗:
∇f (x∗) = 0
Gradient: Should converge to zero
‖∇f (x∗)‖ < tol
Iterates: Distance between xk and xk+1 should converge to zero∥∥∥xk − xk+1∥∥∥ < tol
Function Values: Difference between f (xk) and f (xk+1) shouldconverge to zero ∣∣∣f (xk)− f (xk+1)
∣∣∣ < tol
Number of iterations: Terminate after maxiter iterations
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 9 / 18
Descent direction
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 10 / 18
Descent directionGeometric interpretationd is a descent direction if and only if the angle α between the gradient∇f (x) and d is in a certain range:
π
2= 90◦ < α < 270◦ =
3π2
Algebraic equivalentThe sign of the scalar product between two vectors a and b is determinedby the cosine of the angle α between a and b:
〈a, b〉 = aTb = ‖a‖‖b‖ cosα(a, b)
d is a descent direction if and only if:
dT∇f (x) < 0
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 11 / 18
Stepsize
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 12 / 18
StepsizeArmijo’s ruleTakes two parameters 0 < σ < 1, and 0 < ρ < 0.5
For ` = 0, 1, 2, ... test Armijo condition:
f (p + σ`d) < f (p) + ρσ`dT∇f (p)
Accepted stepsizeFirst ` that passes this test determines the accepted stepsize
t = σ`
Standard Armijo implies, that for the accepted stepsize t always holdst <= 1, only semi-efficient.
Technical detail: WideningTest whether some t > 0 satisfy Armijo condition, i.e. check` = −1,−2, . . . as well, ensures efficiency.
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 13 / 18
1 Supervised Machine Learning
2 Optimization Theory
3 Concrete Optimization Methods
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 14 / 18
Gradient Descent
Descent directionDirection of Steepest Descent, the negative gradient:
d = −∇f (x)
Motivation:corresponds to α = 180◦ = π
obvious choice, always a descent direction, no test neededguarantees the quickest win locallyworks with inexact line search, e.g. Armijo’ s ruleworks for functions f ∈ C1
always solves auxiliary optimization problem
min sT∇f (x), s ∈ Rn, ‖s‖ = 1
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 15 / 18
Conjugate GradientMotivation: Quadratic Model Problem, minimize
f (x) = ‖Ax − b‖2
Optimality condition:
∇f (x∗) = 2AT (Ax∗ − b) = 0
Obvious approach: Solve system of linear equations
ATAx = ATb
Descent directionConsecutive directions di , . . . , di+k satisfy certain orthogonality orconjugacy conditions, M = ATA symmetric positive definite:
dTi Mdj = 0, i 6= j
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 16 / 18
Nonlinear Conjugate GradientInitial Steps:
start at point x0 with d0 = −∇f (x0)
perform exact line search, find
t0 = argmin f (x0 + td0), t > 0
set x1 = x0 + t0d0.Iteration:
set ∆k = −∇f (xk)
compute βk via one of the available formulas (next slide)update conjugate search direction dk = ∆k + βkdk−1
perform exact line search, find
tk = argmin f (xk + tdk), t > 0
set xk+1 = xk + tkdk
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 17 / 18
Nonlinear Conjugate Gradient
Formulas for βk :Fletcher-Reeves
βFRk =∆T
k ∆k
∆Tk−1∆k−1
Polak-Ribière βPRk =∆T
k (∆k −∆k−1)
∆Tk−1∆k−1
Hestenes-Stiefel βHSk = −∆T
k (∆k −∆k−1)
sTk−1 (∆k −∆k−1)
Dai-Yuan βDYk = −∆T
k ∆k
sTk−1 (∆k −∆k−1)
Reasonable choice with automatic direction reset:
β = max{0, βPR
}Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 18 / 18