ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

ADADELTA: An Adaptive Learning Rate Method

Matthew D. Zeiler, Presented by Kyle Franz

University of Kentucky

March 20, 2019

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 1 / 13

What is Special About This?

Most machine learning methods use a process to update a set ofparameters, say x, with the goal to optimize an objective functionf(x). This usually means we look at the parameters at some i-thiteration, namely xi = xi−1 + ∆xi−1

However, Zeiler considers a modified version of gradient descent (wehave mentioned this in class), by following the steepest descent givenby the negative of the gradent gi .

This method uses ∆xi = −ηgi , where gi is the gradent of theparameters at the i-th iteration ∂f (xi )

∂xiand η is some learning rate

which dictates how large of a step to take in the direction of thenegative gradient. These negative gradients give a local estimate ofwhich direction minimizes the cost and is called an SGD.























Less Guess Work

Learning rates are usually picked by hand, this is not verymathematical. This paper tries to remove the task of finding a goodlearning rate.

This approach has many benifits. They include:

Insensitive to hyperparameters.

separate dynamic learning rate per-dimension.

Minimal computation over gradient descent.

and many more...


Less Guess Work






and many more...


Less Guess Work






and many more...


Less Guess Work






and many more...


Less Guess Work






and many more...


Less Guess Work






and many more...


Less Guess Work






and many more...


Related Work

We have seen Newton’s method, which uses ∆xi = H−1gi , H is theHessian.

Momentum: This keeps track of past parameter changes or updateswith a decay, ∆xi = ρ∆xi−1 − ηgi . Where ρ is a constant thatcontrols the decay of the previous parameter updates. This is animprovement to SGD.

ADAGRAD: This uses the update rule ∆xi = − η

(∑i

k=1 g2i )

12gi .

quasi-Newton’s Method: ∆xi = − 1|diag(Hi )|+µ . Note: This has been

modified some by Schaul, Zhang, and LeCun.


Related Work




(∑i

k=1 g2i )

12gi .




Related Work




(∑i

k=1 g2i )

12gi .




Related Work




(∑i

k=1 g2i )

12gi .




Related Work




(∑i

k=1 g2i )

12gi .




ADADELTA and All of its Glory

Idea 1: This is similar to ADAGRAD, but instead of calculating thesum of squared gradients over all time, we chop down the size of pastgradients to some fixed size, say ω. This helps prevent the denomiatorof ADAGRAD from blowing up, and is now a local estimate usingrecent gradients. In other words, learning will continue.

Now you may have noticed storing ω squared gradients is not a veryefficient thing to do. Thus, ADADELTA uses an exponentiallydecaying average of the squared gradients to implement thisaccumlation.

If we assume at time i this average is E [g2]i , we can compute this byE [g2]i = ρE [g2]i−1 + (1− ρ)g2

i . Where ρ is a decay constant, whichcan be compared to the constant used in the momentum approach.




















ADADELTA and All of its Glory Cont.

If we remember in ADAGRAD the denominator has a square root init, so taking the square root of the above basically yields the root

mean square (RMS). That is, RMS [g ]i = (E [g2]i + ε)12 . Where ε is

to help avoid division by 0, as mentioned in class.

Thus, for this method we consider ∆xi = ηRMS[g ]i

gi







gi







gi


Algorithm for idea 1

Given, ρ, ε, x1 (The decay rate, constant, and initial parameterrespectively)

1. E [g2]0 = 0,E [∆x2]0 = 0. (Initializing)

2. for i=1: do (Looping over number of updates)

3. gi (Compute the gradient)

4. E [g2]i = ρE [g2]i−1 + (1− ρ)g2i (Calculating the Accumulated

gradient)

5. ∆xt = −RMS[∆x]i−1

RMS[∆g ]igi (Computing the update)

6.E [∆x2]i = ρE [g2]i−1 + (1− ρ)g2i (accumulate updates)

7. xi+1 = xi + ∆xi (Apply update)

end








gradient)

5. ∆xt = −RMS[∆x]i−1




end








gradient)

5. ∆xt = −RMS[∆x]i−1




end








gradient)

5. ∆xt = −RMS[∆x]i−1




end








gradient)

5. ∆xt = −RMS[∆x]i−1




end








gradient)

5. ∆xt = −RMS[∆x]i−1




end








gradient)

5. ∆xt = −RMS[∆x]i−1




end








gradient)

5. ∆xt = −RMS[∆x]i−1




end








gradient)

5. ∆xt = −RMS[∆x]i−1




end








gradient)

5. ∆xt = −RMS[∆x]i−1




end



Idea 2: While doing all of these updates we want our units to matchfor these calculations to make sense (Assign some random units tothe parameters). Namely, if we change the parameter, the unitsshould change too.

One can notice that these units in SGD and Momentum relate to thegradient, not the parameter, so the units fail to match up. EvenADAGRAD fails.

However, these second order methods, namely ones that use theHessian, so for example Newton’s, do have the correct units forparameter updates:

∆x ∝ H−1g ∝∂f∂x∂2f∂x2

∝ units of x






∆x ∝ H−1g ∝∂f∂x∂2f∂x2

∝ units of x






∆x ∝ H−1g ∝∂f∂x∂2f∂x2

∝ units of x






∆x ∝ H−1g ∝∂f∂x∂2f∂x2

∝ units of x



Now to make sure these work out we will add terms to∆xi = η

RMS[g ]igi (*).

Since secretly we know second order works out, namely Newton, wewill write Newton’s in a different form. That is,

∆x =∂f∂x∂2f∂x2

=⇒ 1∂2f∂x2

=∆x∂f∂x

Because the RMS of these gradients are represented in (*) weconsider some sort of measure of the ∆x . Unfortunately ∆xi is notknown. Assume the curvature is locally smooth, then we can computethe decaying RMS over a window of size ω of old ∆x to estimate ∆xiand yield the ADADELTA method, ∆xt = −RMS[∆x]i−1

RMS[∆g ]igi . Where the

same ε is in both of the RMSs. This makes sure there is still learningsince we will not divide by 0, especially in the initial step.




RMS[g ]igi (*).


∆x =∂f∂x∂2f∂x2

=⇒ 1∂2f∂x2

=∆x∂f∂x







RMS[g ]igi (*).


∆x =∂f∂x∂2f∂x2

=⇒ 1∂2f∂x2

=∆x∂f∂x







RMS[g ]igi (*).


∆x =∂f∂x∂2f∂x2

=⇒ 1∂2f∂x2

=∆x∂f∂x





Experiment

Zeiler ran some test programs and compared many methods.

He used tanh as an activation function, which sigmoid like. Namely,tanh(x) = 2

1+exp(−2x) − 1 = 2sigmoid(2x)− 1.

500 hidden units in the first layer and 300 in a second layer. Followedby a softmax layer on top.

The experiment was done in mini-batches of 100 images per batch for6 epochs (is one complete presentation of the data set to be learnedto a learning machine) through some training set.

ε = 1e − 6, ρ = 0.95

Got to 2.00% error, which is a .1% improvement on Schaul.


Experiment



1+exp(−2x) − 1 = 2sigmoid(2x)− 1.



ε = 1e − 6, ρ = 0.95



Experiment



1+exp(−2x) − 1 = 2sigmoid(2x)− 1.



ε = 1e − 6, ρ = 0.95



Experiment



1+exp(−2x) − 1 = 2sigmoid(2x)− 1.



ε = 1e − 6, ρ = 0.95



Experiment



1+exp(−2x) − 1 = 2sigmoid(2x)− 1.



ε = 1e − 6, ρ = 0.95



Experiment



1+exp(−2x) − 1 = 2sigmoid(2x)− 1.



ε = 1e − 6, ρ = 0.95



Experiment



1+exp(−2x) − 1 = 2sigmoid(2x)− 1.



ε = 1e − 6, ρ = 0.95



Results of Experiment


Comments

It appears like momentum won overall. It is worth noting thatmomemtum, SGD and ADAGRAD all are extremely sensitive to thelearning rate someone selects.

ADADELTA has fast initial convergence like ADAGRAD, but alsoconverges near the best outcome like momentum.

ADADELTA takes the best of ADAGRAD and momentum and splicesthem together.


Comments





Comments





Conclusion

There are a few more experiments in this paper for different scenarios,but in conclusion, ADADELTA is well suited to handle manysituations and has great results.

You should try and use ADADELTA the next time you run anexperiment!


Conclusion




Conclusion




Documents

ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),