Upload
shiladitya-sen
View
205
Download
1
Embed Size (px)
Citation preview
Exploring Optimization in Vowpal Wabbit
-Shiladitya Sen
Vowpal Wabbit• Online• Open Source• Machine Learning Library
• Has achieved record-breaking speed by implementation of• Parallel Processing• Caching• Hashing,etc.
• A “true” library:offers a wide range ofmachine learning andoptimization algorithms
Machine Learning Models
• Linear Regressor ( --loss_function squared)• Logistic Regressor (--loss_function logistic)• SVM (--loss_function hinge)• Neural Networks ( --nn <arg> )• Matrix Factorization• Latent Dirichlet Allocation ( --lda <arg> )• Active Learning ( --active_learning)
Regularization
• L1 Regularization ( --l1 <arg> )
• L2 Regularization ( --l2 <arg> )
Optimization Algorithms
• Online Gradient Descent ( default )
• Conjugate Gradient ( --conjugate_gradient )
• L-BFGS ( --bfgs )
Optimization
The Convex Definition
Convex Sets
Definition:
(0,1) and C y x, whereC y )x - (1ifset convex a be tosaid is of Csubset A n
Convex Functions:
set.convex a isf(x)} and X, x| ){(x,
as defined epigraph, its iffunction convex a be tosaid isin function convex a is X whereX:ffunction valued-realA n
It can be proved from the definition of convex functions that such a function can have no maxima.
In other words…
Might have at most one minimai.e. Local minima is global minima
Loss functions which are convex help in optimization forMachine Learning
Optimization
Algorithm I : Online Gradient Descent
What the batch implementation of Gradient Descent (GD) does
How does Batch-version of GD work?
• Expresses total loss J as a function of a set of parameters : x
•
• Takes a calculated step α in that direction to reach a new point, with new co-ordinate values of x
descent.steepest ofdirection is J(x)- So,ascent.steepest ofdirection theas J(x) Calculates
achieved. is tolerancerequired until continues This)(
:Algorithm
1 tttt xJxx
What is the online implementation of GD?
How does online GD work?
1. Takes a point from the dataset :2. Using existing hypothesis, predicts value3. True value is revealed4. Calculates error J as a function of parameters
x for point5. 6.
7. 8. Moves onto next point
tp
tp
)( Evaluates txJ
)( :descentsteepest ofdirection in the step a Takes
txJ
)( : as parameters Updates 1 ttt xJxx
1tp
Looking Deeper into Online GD
• Essentially calculates error function J(x) independently for each point, as opposed to calculating J(x) as sum of all errors as in Batch implementation (Offline) GD
• To achieve accuracy, Online GD takes multiple passes through the dataset
(Continued…)
Still deeper…• So that a convergence is reached, the step η in
each pass is reduced. In VW, this is implemented as:
• Cache file used for multiple passes (-c)
10][ l ate]learning_r-[- -l-0.5][ p -initial_p-
1][ i -initial_t-1][ d rning_rate-decay_lea-
)(.
''
1
p
eee
pn
iiild
e
So why Online GD?
• It takes less space…
• And my system needs its space!
Optimization
Algorithm II: Method of Conjugate Gradients
What is wrong with Gradient Descent?
•Often takes steps in the same direction•Convergence Issues
Convergence Problems:
The need for Conjugate Gradients:Wouldn’t it be wonderful if we did not need to take steps in the same direction to minimize error in that direction?
This is where Conjugate Gradient comes in…
Method of Orthogonal Directions
• In an (n+1) dimensional vector space where J is defined with n parameters, at most n linearly independent directions for parameters exist
• Error function may have a component in at most n linearly independent (orthogonal) directions
• Intended: A step in each of these directions i.e. at most n steps to minimize the error
• Not solvable for orthogonal directions
Conjugate Directions:
0 : ) respect towith Conjugate(
0 :Orthogonal
directionssearch are d,d ji
jTi
jTi
AddA
dd
How do we get the conjugate directions?
• We first choose n mutually orthogonal directions:
• We calculate as:
nuuu ,...,, 21
id
. calculate to,...,, toorthogonal-not are which of componentsany out Subtracts
k21
1
1
n
i
i
kkkii
dddAu
dud
So what is Method of Conjugate Gradients?
• If we set to , the gradient in the i-th step, we have the Method of Conjugate Gradients.
• The step size in the direction is found by an exact line search.
iu ir
jit independanlinearly are , ji rr
id
The Algorithm for Conjugate Gradient:
Requirement for Preconditioning:
• Round-off errors – leads to slight deviations from Conjugate Directions
• As a result, Conjugate Gradient is implemented iteratively
• To minimize number of iterations, preconditioning is done on the vector space
What is Pre-conditioning?
• The vector space is modified by multiplying a matrix such that M is a symmetric, positive-definite matrix.
• This leads to a better clustering of the eigenvectors and a faster convergence.
-1M
Optimization
Algorithm III: L-BFGS
Why think linearly?
Newton’s Method proposes a step along a non-linear path as opposed to a linear one as in GD and CG..
Leads to a faster convergence…
Newton’s Method:
)).(()(21
)).(()()(:)( ofexpansion series sTaylor'order 2nd
2 xxxJxx
xxxJxJxxJxxJ
T
:get we, respect to with Minimizing x
)()]([ 12 xJxJx
)()]([
:form iterativeIn 12
1 xJxJxx nn
What is this BFGS Algorithm?•
• Named after Broyden-Fletcher-Goldfarb-Shanno
• Maintains an approximate matrix B and updates B upon each iteration
BFGS. is which amongpopular most Methods,Newton - toQuasiled
)]([][ gcalculatinin onsComplicati 121 xJHB
BFGS Algorithm:
Memory is a limited asset• In Vowpal Wabbit, the version of BFGS
implemented is L-BFGS• In L-BFGS, all the previous updates to B are
not stored in memory• At a particular iteration i, only the last m
updates are stored and used to make new update
• Also, the step size η in each step is calculated by an inexact line search following Wolfe’s Conditions.