Exploring Optimization in Vowpal Wabbit

Exploring Optimization in Vowpal Wabbit

-Shiladitya Sen

Vowpal Wabbit• Online• Open Source• Machine Learning Library

• Has achieved record-breaking speed by implementation of• Parallel Processing• Caching• Hashing,etc.

• A “true” library:offers a wide range ofmachine learning andoptimization algorithms

Machine Learning Models

• Linear Regressor ( --loss_function squared)• Logistic Regressor (--loss_function logistic)• SVM (--loss_function hinge)• Neural Networks ( --nn <arg> )• Matrix Factorization• Latent Dirichlet Allocation ( --lda <arg> )• Active Learning ( --active_learning)

Regularization

• L1 Regularization ( --l1 <arg> )

• L2 Regularization ( --l2 <arg> )

Optimization Algorithms

• Online Gradient Descent ( default )

• Conjugate Gradient ( --conjugate_gradient )

• L-BFGS ( --bfgs )

Optimization

The Convex Definition

Convex Sets

Definition:

(0,1) and C y x, whereC y )x - (1ifset convex a be tosaid is of Csubset A n

Convex Functions:

set.convex a isf(x)} and X, x| ){(x,

as defined epigraph, its iffunction convex a be tosaid isin function convex a is X whereX:ffunction valued-realA n

It can be proved from the definition of convex functions that such a function can have no maxima.

In other words…

Might have at most one minimai.e. Local minima is global minima

Loss functions which are convex help in optimization forMachine Learning

Optimization

Algorithm I : Online Gradient Descent

What the batch implementation of Gradient Descent (GD) does

How does Batch-version of GD work?

• Expresses total loss J as a function of a set of parameters : x

•

• Takes a calculated step α in that direction to reach a new point, with new co-ordinate values of x

descent.steepest ofdirection is J(x)- So,ascent.steepest ofdirection theas J(x) Calculates

achieved. is tolerancerequired until continues This)(

:Algorithm

1 tttt xJxx

What is the online implementation of GD?

How does online GD work?

1. Takes a point from the dataset :2. Using existing hypothesis, predicts value3. True value is revealed4. Calculates error J as a function of parameters

x for point5. 6.

7. 8. Moves onto next point

tp

tp

)( Evaluates txJ

)( :descentsteepest ofdirection in the step a Takes

txJ

)( : as parameters Updates 1 ttt xJxx

1tp

Looking Deeper into Online GD

• Essentially calculates error function J(x) independently for each point, as opposed to calculating J(x) as sum of all errors as in Batch implementation (Offline) GD

• To achieve accuracy, Online GD takes multiple passes through the dataset

(Continued…)

Still deeper…• So that a convergence is reached, the step η in

each pass is reduced. In VW, this is implemented as:

• Cache file used for multiple passes (-c)

10][ l ate]learning_r-[- -l-0.5][ p -initial_p-

1][ i -initial_t-1][ d rning_rate-decay_lea-

)(.

''

1

p

eee

pn

iiild

e

So why Online GD?

• It takes less space…

• And my system needs its space!

Optimization

Algorithm II: Method of Conjugate Gradients

What is wrong with Gradient Descent?

•Often takes steps in the same direction•Convergence Issues

Convergence Problems:

The need for Conjugate Gradients:Wouldn’t it be wonderful if we did not need to take steps in the same direction to minimize error in that direction?

This is where Conjugate Gradient comes in…

Method of Orthogonal Directions

• In an (n+1) dimensional vector space where J is defined with n parameters, at most n linearly independent directions for parameters exist

• Error function may have a component in at most n linearly independent (orthogonal) directions

• Intended: A step in each of these directions i.e. at most n steps to minimize the error

• Not solvable for orthogonal directions

Conjugate Directions:

0 : ) respect towith Conjugate(

0 :Orthogonal

directionssearch are d,d ji

jTi

jTi

AddA

dd

How do we get the conjugate directions?

• We first choose n mutually orthogonal directions:

• We calculate as:

nuuu ,...,, 21

id

. calculate to,...,, toorthogonal-not are which of componentsany out Subtracts

k21

1

1

n

i

i

kkkii

dddAu

dud

So what is Method of Conjugate Gradients?

• If we set to , the gradient in the i-th step, we have the Method of Conjugate Gradients.

• The step size in the direction is found by an exact line search.

iu ir

jit independanlinearly are , ji rr

id

The Algorithm for Conjugate Gradient:

Requirement for Preconditioning:

• Round-off errors – leads to slight deviations from Conjugate Directions

• As a result, Conjugate Gradient is implemented iteratively

• To minimize number of iterations, preconditioning is done on the vector space

What is Pre-conditioning?

• The vector space is modified by multiplying a matrix such that M is a symmetric, positive-definite matrix.

• This leads to a better clustering of the eigenvectors and a faster convergence.

-1M

Optimization

Algorithm III: L-BFGS

Why think linearly?

Newton’s Method proposes a step along a non-linear path as opposed to a linear one as in GD and CG..

Leads to a faster convergence…

Newton’s Method:

)).(()(21

)).(()()(:)( ofexpansion series sTaylor'order 2nd

2 xxxJxx

xxxJxJxxJxxJ

T

:get we, respect to with Minimizing x

)()]([ 12 xJxJx

)()]([

:form iterativeIn 12

1 xJxJxx nn

What is this BFGS Algorithm?•

• Named after Broyden-Fletcher-Goldfarb-Shanno

• Maintains an approximate matrix B and updates B upon each iteration

BFGS. is which amongpopular most Methods,Newton - toQuasiled

)]([][ gcalculatinin onsComplicati 121 xJHB

BFGS Algorithm:

Memory is a limited asset• In Vowpal Wabbit, the version of BFGS

implemented is L-BFGS• In L-BFGS, all the previous updates to B are

not stored in memory• At a particular iteration i, only the last m

updates are stored and used to make new update

• Also, the step size η in each step is calculated by an inexact line search following Wolfe’s Conditions.

Documents

Exploring Optimization in Vowpal Wabbit