56
ADADELTA: An Adaptive Learning Rate Method Matthew D. Zeiler, Presented by Kyle Franz University of Kentucky March 20, 2019 Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky) ADADELTA: An Adaptive Learning Rate Method March 20, 2019 1 / 13

ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

ADADELTA: An Adaptive Learning Rate Method

Matthew D. Zeiler, Presented by Kyle Franz

University of Kentucky

March 20, 2019

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 1 / 13

Page 2: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

What is Special About This?

Most machine learning methods use a process to update a set ofparameters, say x, with the goal to optimize an objective functionf(x). This usually means we look at the parameters at some i-thiteration, namely xi = xi−1 + ∆xi−1

However, Zeiler considers a modified version of gradient descent (wehave mentioned this in class), by following the steepest descent givenby the negative of the gradent gi .

This method uses ∆xi = −ηgi , where gi is the gradent of theparameters at the i-th iteration ∂f (xi )

∂xiand η is some learning rate

which dictates how large of a step to take in the direction of thenegative gradient. These negative gradients give a local estimate ofwhich direction minimizes the cost and is called an SGD.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 2 / 13

Page 3: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

What is Special About This?

Most machine learning methods use a process to update a set ofparameters, say x, with the goal to optimize an objective functionf(x). This usually means we look at the parameters at some i-thiteration, namely xi = xi−1 + ∆xi−1

However, Zeiler considers a modified version of gradient descent (wehave mentioned this in class), by following the steepest descent givenby the negative of the gradent gi .

This method uses ∆xi = −ηgi , where gi is the gradent of theparameters at the i-th iteration ∂f (xi )

∂xiand η is some learning rate

which dictates how large of a step to take in the direction of thenegative gradient. These negative gradients give a local estimate ofwhich direction minimizes the cost and is called an SGD.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 2 / 13

Page 4: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

What is Special About This?

Most machine learning methods use a process to update a set ofparameters, say x, with the goal to optimize an objective functionf(x). This usually means we look at the parameters at some i-thiteration, namely xi = xi−1 + ∆xi−1

However, Zeiler considers a modified version of gradient descent (wehave mentioned this in class), by following the steepest descent givenby the negative of the gradent gi .

This method uses ∆xi = −ηgi , where gi is the gradent of theparameters at the i-th iteration ∂f (xi )

∂xiand η is some learning rate

which dictates how large of a step to take in the direction of thenegative gradient. These negative gradients give a local estimate ofwhich direction minimizes the cost and is called an SGD.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 2 / 13

Page 5: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

What is Special About This?

Most machine learning methods use a process to update a set ofparameters, say x, with the goal to optimize an objective functionf(x). This usually means we look at the parameters at some i-thiteration, namely xi = xi−1 + ∆xi−1

However, Zeiler considers a modified version of gradient descent (wehave mentioned this in class), by following the steepest descent givenby the negative of the gradent gi .

This method uses ∆xi = −ηgi , where gi is the gradent of theparameters at the i-th iteration ∂f (xi )

∂xiand η is some learning rate

which dictates how large of a step to take in the direction of thenegative gradient. These negative gradients give a local estimate ofwhich direction minimizes the cost and is called an SGD.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 2 / 13

Page 6: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Less Guess Work

Learning rates are usually picked by hand, this is not verymathematical. This paper tries to remove the task of finding a goodlearning rate.

This approach has many benifits. They include:

Insensitive to hyperparameters.

separate dynamic learning rate per-dimension.

Minimal computation over gradient descent.

and many more...

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 3 / 13

Page 7: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Less Guess Work

Learning rates are usually picked by hand, this is not verymathematical. This paper tries to remove the task of finding a goodlearning rate.

This approach has many benifits. They include:

Insensitive to hyperparameters.

separate dynamic learning rate per-dimension.

Minimal computation over gradient descent.

and many more...

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 3 / 13

Page 8: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Less Guess Work

Learning rates are usually picked by hand, this is not verymathematical. This paper tries to remove the task of finding a goodlearning rate.

This approach has many benifits. They include:

Insensitive to hyperparameters.

separate dynamic learning rate per-dimension.

Minimal computation over gradient descent.

and many more...

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 3 / 13

Page 9: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Less Guess Work

Learning rates are usually picked by hand, this is not verymathematical. This paper tries to remove the task of finding a goodlearning rate.

This approach has many benifits. They include:

Insensitive to hyperparameters.

separate dynamic learning rate per-dimension.

Minimal computation over gradient descent.

and many more...

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 3 / 13

Page 10: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Less Guess Work

Learning rates are usually picked by hand, this is not verymathematical. This paper tries to remove the task of finding a goodlearning rate.

This approach has many benifits. They include:

Insensitive to hyperparameters.

separate dynamic learning rate per-dimension.

Minimal computation over gradient descent.

and many more...

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 3 / 13

Page 11: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Less Guess Work

Learning rates are usually picked by hand, this is not verymathematical. This paper tries to remove the task of finding a goodlearning rate.

This approach has many benifits. They include:

Insensitive to hyperparameters.

separate dynamic learning rate per-dimension.

Minimal computation over gradient descent.

and many more...

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 3 / 13

Page 12: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Less Guess Work

Learning rates are usually picked by hand, this is not verymathematical. This paper tries to remove the task of finding a goodlearning rate.

This approach has many benifits. They include:

Insensitive to hyperparameters.

separate dynamic learning rate per-dimension.

Minimal computation over gradient descent.

and many more...

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 3 / 13

Page 13: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Related Work

We have seen Newton’s method, which uses ∆xi = H−1gi , H is theHessian.

Momentum: This keeps track of past parameter changes or updateswith a decay, ∆xi = ρ∆xi−1 − ηgi . Where ρ is a constant thatcontrols the decay of the previous parameter updates. This is animprovement to SGD.

ADAGRAD: This uses the update rule ∆xi = − η

(∑i

k=1 g2i )

12gi .

quasi-Newton’s Method: ∆xi = − 1|diag(Hi )|+µ . Note: This has been

modified some by Schaul, Zhang, and LeCun.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 4 / 13

Page 14: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Related Work

We have seen Newton’s method, which uses ∆xi = H−1gi , H is theHessian.

Momentum: This keeps track of past parameter changes or updateswith a decay, ∆xi = ρ∆xi−1 − ηgi . Where ρ is a constant thatcontrols the decay of the previous parameter updates. This is animprovement to SGD.

ADAGRAD: This uses the update rule ∆xi = − η

(∑i

k=1 g2i )

12gi .

quasi-Newton’s Method: ∆xi = − 1|diag(Hi )|+µ . Note: This has been

modified some by Schaul, Zhang, and LeCun.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 4 / 13

Page 15: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Related Work

We have seen Newton’s method, which uses ∆xi = H−1gi , H is theHessian.

Momentum: This keeps track of past parameter changes or updateswith a decay, ∆xi = ρ∆xi−1 − ηgi . Where ρ is a constant thatcontrols the decay of the previous parameter updates. This is animprovement to SGD.

ADAGRAD: This uses the update rule ∆xi = − η

(∑i

k=1 g2i )

12gi .

quasi-Newton’s Method: ∆xi = − 1|diag(Hi )|+µ . Note: This has been

modified some by Schaul, Zhang, and LeCun.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 4 / 13

Page 16: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Related Work

We have seen Newton’s method, which uses ∆xi = H−1gi , H is theHessian.

Momentum: This keeps track of past parameter changes or updateswith a decay, ∆xi = ρ∆xi−1 − ηgi . Where ρ is a constant thatcontrols the decay of the previous parameter updates. This is animprovement to SGD.

ADAGRAD: This uses the update rule ∆xi = − η

(∑i

k=1 g2i )

12gi .

quasi-Newton’s Method: ∆xi = − 1|diag(Hi )|+µ . Note: This has been

modified some by Schaul, Zhang, and LeCun.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 4 / 13

Page 17: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Related Work

We have seen Newton’s method, which uses ∆xi = H−1gi , H is theHessian.

Momentum: This keeps track of past parameter changes or updateswith a decay, ∆xi = ρ∆xi−1 − ηgi . Where ρ is a constant thatcontrols the decay of the previous parameter updates. This is animprovement to SGD.

ADAGRAD: This uses the update rule ∆xi = − η

(∑i

k=1 g2i )

12gi .

quasi-Newton’s Method: ∆xi = − 1|diag(Hi )|+µ . Note: This has been

modified some by Schaul, Zhang, and LeCun.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 4 / 13

Page 18: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

ADADELTA and All of its Glory

Idea 1: This is similar to ADAGRAD, but instead of calculating thesum of squared gradients over all time, we chop down the size of pastgradients to some fixed size, say ω. This helps prevent the denomiatorof ADAGRAD from blowing up, and is now a local estimate usingrecent gradients. In other words, learning will continue.

Now you may have noticed storing ω squared gradients is not a veryefficient thing to do. Thus, ADADELTA uses an exponentiallydecaying average of the squared gradients to implement thisaccumlation.

If we assume at time i this average is E [g2]i , we can compute this byE [g2]i = ρE [g2]i−1 + (1− ρ)g2

i . Where ρ is a decay constant, whichcan be compared to the constant used in the momentum approach.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 5 / 13

Page 19: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

ADADELTA and All of its Glory

Idea 1: This is similar to ADAGRAD, but instead of calculating thesum of squared gradients over all time, we chop down the size of pastgradients to some fixed size, say ω. This helps prevent the denomiatorof ADAGRAD from blowing up, and is now a local estimate usingrecent gradients. In other words, learning will continue.

Now you may have noticed storing ω squared gradients is not a veryefficient thing to do. Thus, ADADELTA uses an exponentiallydecaying average of the squared gradients to implement thisaccumlation.

If we assume at time i this average is E [g2]i , we can compute this byE [g2]i = ρE [g2]i−1 + (1− ρ)g2

i . Where ρ is a decay constant, whichcan be compared to the constant used in the momentum approach.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 5 / 13

Page 20: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

ADADELTA and All of its Glory

Idea 1: This is similar to ADAGRAD, but instead of calculating thesum of squared gradients over all time, we chop down the size of pastgradients to some fixed size, say ω. This helps prevent the denomiatorof ADAGRAD from blowing up, and is now a local estimate usingrecent gradients. In other words, learning will continue.

Now you may have noticed storing ω squared gradients is not a veryefficient thing to do. Thus, ADADELTA uses an exponentiallydecaying average of the squared gradients to implement thisaccumlation.

If we assume at time i this average is E [g2]i , we can compute this byE [g2]i = ρE [g2]i−1 + (1− ρ)g2

i . Where ρ is a decay constant, whichcan be compared to the constant used in the momentum approach.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 5 / 13

Page 21: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

ADADELTA and All of its Glory

Idea 1: This is similar to ADAGRAD, but instead of calculating thesum of squared gradients over all time, we chop down the size of pastgradients to some fixed size, say ω. This helps prevent the denomiatorof ADAGRAD from blowing up, and is now a local estimate usingrecent gradients. In other words, learning will continue.

Now you may have noticed storing ω squared gradients is not a veryefficient thing to do. Thus, ADADELTA uses an exponentiallydecaying average of the squared gradients to implement thisaccumlation.

If we assume at time i this average is E [g2]i , we can compute this byE [g2]i = ρE [g2]i−1 + (1− ρ)g2

i . Where ρ is a decay constant, whichcan be compared to the constant used in the momentum approach.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 5 / 13

Page 22: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

ADADELTA and All of its Glory Cont.

If we remember in ADAGRAD the denominator has a square root init, so taking the square root of the above basically yields the root

mean square (RMS). That is, RMS [g ]i = (E [g2]i + ε)12 . Where ε is

to help avoid division by 0, as mentioned in class.

Thus, for this method we consider ∆xi = ηRMS[g ]i

gi

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 6 / 13

Page 23: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

ADADELTA and All of its Glory Cont.

If we remember in ADAGRAD the denominator has a square root init, so taking the square root of the above basically yields the root

mean square (RMS). That is, RMS [g ]i = (E [g2]i + ε)12 . Where ε is

to help avoid division by 0, as mentioned in class.

Thus, for this method we consider ∆xi = ηRMS[g ]i

gi

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 6 / 13

Page 24: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

ADADELTA and All of its Glory Cont.

If we remember in ADAGRAD the denominator has a square root init, so taking the square root of the above basically yields the root

mean square (RMS). That is, RMS [g ]i = (E [g2]i + ε)12 . Where ε is

to help avoid division by 0, as mentioned in class.

Thus, for this method we consider ∆xi = ηRMS[g ]i

gi

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 6 / 13

Page 25: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Algorithm for idea 1

Given, ρ, ε, x1 (The decay rate, constant, and initial parameterrespectively)

1. E [g2]0 = 0,E [∆x2]0 = 0. (Initializing)

2. for i=1: do (Looping over number of updates)

3. gi (Compute the gradient)

4. E [g2]i = ρE [g2]i−1 + (1− ρ)g2i (Calculating the Accumulated

gradient)

5. ∆xt = −RMS[∆x]i−1

RMS[∆g ]igi (Computing the update)

6.E [∆x2]i = ρE [g2]i−1 + (1− ρ)g2i (accumulate updates)

7. xi+1 = xi + ∆xi (Apply update)

end

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 7 / 13

Page 26: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Algorithm for idea 1

Given, ρ, ε, x1 (The decay rate, constant, and initial parameterrespectively)

1. E [g2]0 = 0,E [∆x2]0 = 0. (Initializing)

2. for i=1: do (Looping over number of updates)

3. gi (Compute the gradient)

4. E [g2]i = ρE [g2]i−1 + (1− ρ)g2i (Calculating the Accumulated

gradient)

5. ∆xt = −RMS[∆x]i−1

RMS[∆g ]igi (Computing the update)

6.E [∆x2]i = ρE [g2]i−1 + (1− ρ)g2i (accumulate updates)

7. xi+1 = xi + ∆xi (Apply update)

end

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 7 / 13

Page 27: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Algorithm for idea 1

Given, ρ, ε, x1 (The decay rate, constant, and initial parameterrespectively)

1. E [g2]0 = 0,E [∆x2]0 = 0. (Initializing)

2. for i=1: do (Looping over number of updates)

3. gi (Compute the gradient)

4. E [g2]i = ρE [g2]i−1 + (1− ρ)g2i (Calculating the Accumulated

gradient)

5. ∆xt = −RMS[∆x]i−1

RMS[∆g ]igi (Computing the update)

6.E [∆x2]i = ρE [g2]i−1 + (1− ρ)g2i (accumulate updates)

7. xi+1 = xi + ∆xi (Apply update)

end

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 7 / 13

Page 28: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Algorithm for idea 1

Given, ρ, ε, x1 (The decay rate, constant, and initial parameterrespectively)

1. E [g2]0 = 0,E [∆x2]0 = 0. (Initializing)

2. for i=1: do (Looping over number of updates)

3. gi (Compute the gradient)

4. E [g2]i = ρE [g2]i−1 + (1− ρ)g2i (Calculating the Accumulated

gradient)

5. ∆xt = −RMS[∆x]i−1

RMS[∆g ]igi (Computing the update)

6.E [∆x2]i = ρE [g2]i−1 + (1− ρ)g2i (accumulate updates)

7. xi+1 = xi + ∆xi (Apply update)

end

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 7 / 13

Page 29: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Algorithm for idea 1

Given, ρ, ε, x1 (The decay rate, constant, and initial parameterrespectively)

1. E [g2]0 = 0,E [∆x2]0 = 0. (Initializing)

2. for i=1: do (Looping over number of updates)

3. gi (Compute the gradient)

4. E [g2]i = ρE [g2]i−1 + (1− ρ)g2i (Calculating the Accumulated

gradient)

5. ∆xt = −RMS[∆x]i−1

RMS[∆g ]igi (Computing the update)

6.E [∆x2]i = ρE [g2]i−1 + (1− ρ)g2i (accumulate updates)

7. xi+1 = xi + ∆xi (Apply update)

end

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 7 / 13

Page 30: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Algorithm for idea 1

Given, ρ, ε, x1 (The decay rate, constant, and initial parameterrespectively)

1. E [g2]0 = 0,E [∆x2]0 = 0. (Initializing)

2. for i=1: do (Looping over number of updates)

3. gi (Compute the gradient)

4. E [g2]i = ρE [g2]i−1 + (1− ρ)g2i (Calculating the Accumulated

gradient)

5. ∆xt = −RMS[∆x]i−1

RMS[∆g ]igi (Computing the update)

6.E [∆x2]i = ρE [g2]i−1 + (1− ρ)g2i (accumulate updates)

7. xi+1 = xi + ∆xi (Apply update)

end

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 7 / 13

Page 31: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Algorithm for idea 1

Given, ρ, ε, x1 (The decay rate, constant, and initial parameterrespectively)

1. E [g2]0 = 0,E [∆x2]0 = 0. (Initializing)

2. for i=1: do (Looping over number of updates)

3. gi (Compute the gradient)

4. E [g2]i = ρE [g2]i−1 + (1− ρ)g2i (Calculating the Accumulated

gradient)

5. ∆xt = −RMS[∆x]i−1

RMS[∆g ]igi (Computing the update)

6.E [∆x2]i = ρE [g2]i−1 + (1− ρ)g2i (accumulate updates)

7. xi+1 = xi + ∆xi (Apply update)

end

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 7 / 13

Page 32: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Algorithm for idea 1

Given, ρ, ε, x1 (The decay rate, constant, and initial parameterrespectively)

1. E [g2]0 = 0,E [∆x2]0 = 0. (Initializing)

2. for i=1: do (Looping over number of updates)

3. gi (Compute the gradient)

4. E [g2]i = ρE [g2]i−1 + (1− ρ)g2i (Calculating the Accumulated

gradient)

5. ∆xt = −RMS[∆x]i−1

RMS[∆g ]igi (Computing the update)

6.E [∆x2]i = ρE [g2]i−1 + (1− ρ)g2i (accumulate updates)

7. xi+1 = xi + ∆xi (Apply update)

end

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 7 / 13

Page 33: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Algorithm for idea 1

Given, ρ, ε, x1 (The decay rate, constant, and initial parameterrespectively)

1. E [g2]0 = 0,E [∆x2]0 = 0. (Initializing)

2. for i=1: do (Looping over number of updates)

3. gi (Compute the gradient)

4. E [g2]i = ρE [g2]i−1 + (1− ρ)g2i (Calculating the Accumulated

gradient)

5. ∆xt = −RMS[∆x]i−1

RMS[∆g ]igi (Computing the update)

6.E [∆x2]i = ρE [g2]i−1 + (1− ρ)g2i (accumulate updates)

7. xi+1 = xi + ∆xi (Apply update)

end

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 7 / 13

Page 34: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Algorithm for idea 1

Given, ρ, ε, x1 (The decay rate, constant, and initial parameterrespectively)

1. E [g2]0 = 0,E [∆x2]0 = 0. (Initializing)

2. for i=1: do (Looping over number of updates)

3. gi (Compute the gradient)

4. E [g2]i = ρE [g2]i−1 + (1− ρ)g2i (Calculating the Accumulated

gradient)

5. ∆xt = −RMS[∆x]i−1

RMS[∆g ]igi (Computing the update)

6.E [∆x2]i = ρE [g2]i−1 + (1− ρ)g2i (accumulate updates)

7. xi+1 = xi + ∆xi (Apply update)

end

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 7 / 13

Page 35: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

ADADELTA and All of its Glory Cont.

Idea 2: While doing all of these updates we want our units to matchfor these calculations to make sense (Assign some random units tothe parameters). Namely, if we change the parameter, the unitsshould change too.

One can notice that these units in SGD and Momentum relate to thegradient, not the parameter, so the units fail to match up. EvenADAGRAD fails.

However, these second order methods, namely ones that use theHessian, so for example Newton’s, do have the correct units forparameter updates:

∆x ∝ H−1g ∝∂f∂x∂2f∂x2

∝ units of x

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 8 / 13

Page 36: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

ADADELTA and All of its Glory Cont.

Idea 2: While doing all of these updates we want our units to matchfor these calculations to make sense (Assign some random units tothe parameters). Namely, if we change the parameter, the unitsshould change too.

One can notice that these units in SGD and Momentum relate to thegradient, not the parameter, so the units fail to match up. EvenADAGRAD fails.

However, these second order methods, namely ones that use theHessian, so for example Newton’s, do have the correct units forparameter updates:

∆x ∝ H−1g ∝∂f∂x∂2f∂x2

∝ units of x

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 8 / 13

Page 37: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

ADADELTA and All of its Glory Cont.

Idea 2: While doing all of these updates we want our units to matchfor these calculations to make sense (Assign some random units tothe parameters). Namely, if we change the parameter, the unitsshould change too.

One can notice that these units in SGD and Momentum relate to thegradient, not the parameter, so the units fail to match up. EvenADAGRAD fails.

However, these second order methods, namely ones that use theHessian, so for example Newton’s, do have the correct units forparameter updates:

∆x ∝ H−1g ∝∂f∂x∂2f∂x2

∝ units of x

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 8 / 13

Page 38: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

ADADELTA and All of its Glory Cont.

Idea 2: While doing all of these updates we want our units to matchfor these calculations to make sense (Assign some random units tothe parameters). Namely, if we change the parameter, the unitsshould change too.

One can notice that these units in SGD and Momentum relate to thegradient, not the parameter, so the units fail to match up. EvenADAGRAD fails.

However, these second order methods, namely ones that use theHessian, so for example Newton’s, do have the correct units forparameter updates:

∆x ∝ H−1g ∝∂f∂x∂2f∂x2

∝ units of x

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 8 / 13

Page 39: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

ADADELTA and All of its Glory Cont.

Now to make sure these work out we will add terms to∆xi = η

RMS[g ]igi (*).

Since secretly we know second order works out, namely Newton, wewill write Newton’s in a different form. That is,

∆x =∂f∂x∂2f∂x2

=⇒ 1∂2f∂x2

=∆x∂f∂x

Because the RMS of these gradients are represented in (*) weconsider some sort of measure of the ∆x . Unfortunately ∆xi is notknown. Assume the curvature is locally smooth, then we can computethe decaying RMS over a window of size ω of old ∆x to estimate ∆xiand yield the ADADELTA method, ∆xt = −RMS[∆x]i−1

RMS[∆g ]igi . Where the

same ε is in both of the RMSs. This makes sure there is still learningsince we will not divide by 0, especially in the initial step.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 9 / 13

Page 40: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

ADADELTA and All of its Glory Cont.

Now to make sure these work out we will add terms to∆xi = η

RMS[g ]igi (*).

Since secretly we know second order works out, namely Newton, wewill write Newton’s in a different form. That is,

∆x =∂f∂x∂2f∂x2

=⇒ 1∂2f∂x2

=∆x∂f∂x

Because the RMS of these gradients are represented in (*) weconsider some sort of measure of the ∆x . Unfortunately ∆xi is notknown. Assume the curvature is locally smooth, then we can computethe decaying RMS over a window of size ω of old ∆x to estimate ∆xiand yield the ADADELTA method, ∆xt = −RMS[∆x]i−1

RMS[∆g ]igi . Where the

same ε is in both of the RMSs. This makes sure there is still learningsince we will not divide by 0, especially in the initial step.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 9 / 13

Page 41: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

ADADELTA and All of its Glory Cont.

Now to make sure these work out we will add terms to∆xi = η

RMS[g ]igi (*).

Since secretly we know second order works out, namely Newton, wewill write Newton’s in a different form. That is,

∆x =∂f∂x∂2f∂x2

=⇒ 1∂2f∂x2

=∆x∂f∂x

Because the RMS of these gradients are represented in (*) weconsider some sort of measure of the ∆x . Unfortunately ∆xi is notknown. Assume the curvature is locally smooth, then we can computethe decaying RMS over a window of size ω of old ∆x to estimate ∆xiand yield the ADADELTA method, ∆xt = −RMS[∆x]i−1

RMS[∆g ]igi . Where the

same ε is in both of the RMSs. This makes sure there is still learningsince we will not divide by 0, especially in the initial step.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 9 / 13

Page 42: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

ADADELTA and All of its Glory Cont.

Now to make sure these work out we will add terms to∆xi = η

RMS[g ]igi (*).

Since secretly we know second order works out, namely Newton, wewill write Newton’s in a different form. That is,

∆x =∂f∂x∂2f∂x2

=⇒ 1∂2f∂x2

=∆x∂f∂x

Because the RMS of these gradients are represented in (*) weconsider some sort of measure of the ∆x . Unfortunately ∆xi is notknown. Assume the curvature is locally smooth, then we can computethe decaying RMS over a window of size ω of old ∆x to estimate ∆xiand yield the ADADELTA method, ∆xt = −RMS[∆x]i−1

RMS[∆g ]igi . Where the

same ε is in both of the RMSs. This makes sure there is still learningsince we will not divide by 0, especially in the initial step.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 9 / 13

Page 43: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Experiment

Zeiler ran some test programs and compared many methods.

He used tanh as an activation function, which sigmoid like. Namely,tanh(x) = 2

1+exp(−2x) − 1 = 2sigmoid(2x)− 1.

500 hidden units in the first layer and 300 in a second layer. Followedby a softmax layer on top.

The experiment was done in mini-batches of 100 images per batch for6 epochs (is one complete presentation of the data set to be learnedto a learning machine) through some training set.

ε = 1e − 6, ρ = 0.95

Got to 2.00% error, which is a .1% improvement on Schaul.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 10 / 13

Page 44: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Experiment

Zeiler ran some test programs and compared many methods.

He used tanh as an activation function, which sigmoid like. Namely,tanh(x) = 2

1+exp(−2x) − 1 = 2sigmoid(2x)− 1.

500 hidden units in the first layer and 300 in a second layer. Followedby a softmax layer on top.

The experiment was done in mini-batches of 100 images per batch for6 epochs (is one complete presentation of the data set to be learnedto a learning machine) through some training set.

ε = 1e − 6, ρ = 0.95

Got to 2.00% error, which is a .1% improvement on Schaul.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 10 / 13

Page 45: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Experiment

Zeiler ran some test programs and compared many methods.

He used tanh as an activation function, which sigmoid like. Namely,tanh(x) = 2

1+exp(−2x) − 1 = 2sigmoid(2x)− 1.

500 hidden units in the first layer and 300 in a second layer. Followedby a softmax layer on top.

The experiment was done in mini-batches of 100 images per batch for6 epochs (is one complete presentation of the data set to be learnedto a learning machine) through some training set.

ε = 1e − 6, ρ = 0.95

Got to 2.00% error, which is a .1% improvement on Schaul.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 10 / 13

Page 46: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Experiment

Zeiler ran some test programs and compared many methods.

He used tanh as an activation function, which sigmoid like. Namely,tanh(x) = 2

1+exp(−2x) − 1 = 2sigmoid(2x)− 1.

500 hidden units in the first layer and 300 in a second layer. Followedby a softmax layer on top.

The experiment was done in mini-batches of 100 images per batch for6 epochs (is one complete presentation of the data set to be learnedto a learning machine) through some training set.

ε = 1e − 6, ρ = 0.95

Got to 2.00% error, which is a .1% improvement on Schaul.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 10 / 13

Page 47: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Experiment

Zeiler ran some test programs and compared many methods.

He used tanh as an activation function, which sigmoid like. Namely,tanh(x) = 2

1+exp(−2x) − 1 = 2sigmoid(2x)− 1.

500 hidden units in the first layer and 300 in a second layer. Followedby a softmax layer on top.

The experiment was done in mini-batches of 100 images per batch for6 epochs (is one complete presentation of the data set to be learnedto a learning machine) through some training set.

ε = 1e − 6, ρ = 0.95

Got to 2.00% error, which is a .1% improvement on Schaul.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 10 / 13

Page 48: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Experiment

Zeiler ran some test programs and compared many methods.

He used tanh as an activation function, which sigmoid like. Namely,tanh(x) = 2

1+exp(−2x) − 1 = 2sigmoid(2x)− 1.

500 hidden units in the first layer and 300 in a second layer. Followedby a softmax layer on top.

The experiment was done in mini-batches of 100 images per batch for6 epochs (is one complete presentation of the data set to be learnedto a learning machine) through some training set.

ε = 1e − 6, ρ = 0.95

Got to 2.00% error, which is a .1% improvement on Schaul.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 10 / 13

Page 49: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Experiment

Zeiler ran some test programs and compared many methods.

He used tanh as an activation function, which sigmoid like. Namely,tanh(x) = 2

1+exp(−2x) − 1 = 2sigmoid(2x)− 1.

500 hidden units in the first layer and 300 in a second layer. Followedby a softmax layer on top.

The experiment was done in mini-batches of 100 images per batch for6 epochs (is one complete presentation of the data set to be learnedto a learning machine) through some training set.

ε = 1e − 6, ρ = 0.95

Got to 2.00% error, which is a .1% improvement on Schaul.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 10 / 13

Page 50: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Results of Experiment

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 11 / 13

Page 51: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Comments

It appears like momentum won overall. It is worth noting thatmomemtum, SGD and ADAGRAD all are extremely sensitive to thelearning rate someone selects.

ADADELTA has fast initial convergence like ADAGRAD, but alsoconverges near the best outcome like momentum.

ADADELTA takes the best of ADAGRAD and momentum and splicesthem together.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 12 / 13

Page 52: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Comments

It appears like momentum won overall. It is worth noting thatmomemtum, SGD and ADAGRAD all are extremely sensitive to thelearning rate someone selects.

ADADELTA has fast initial convergence like ADAGRAD, but alsoconverges near the best outcome like momentum.

ADADELTA takes the best of ADAGRAD and momentum and splicesthem together.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 12 / 13

Page 53: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Comments

It appears like momentum won overall. It is worth noting thatmomemtum, SGD and ADAGRAD all are extremely sensitive to thelearning rate someone selects.

ADADELTA has fast initial convergence like ADAGRAD, but alsoconverges near the best outcome like momentum.

ADADELTA takes the best of ADAGRAD and momentum and splicesthem together.

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 12 / 13

Page 54: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Conclusion

There are a few more experiments in this paper for different scenarios,but in conclusion, ADADELTA is well suited to handle manysituations and has great results.

You should try and use ADADELTA the next time you run anexperiment!

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 13 / 13

Page 55: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Conclusion

There are a few more experiments in this paper for different scenarios,but in conclusion, ADADELTA is well suited to handle manysituations and has great results.

You should try and use ADADELTA the next time you run anexperiment!

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 13 / 13

Page 56: ADADELTA: An Adaptive Learning Rate Methodqye/MA721/presentations/ADADELTA_MA_721.pdfHowever, Zeiler considers a modi ed version of gradient descent (we have mentioned this in class),

Conclusion

There are a few more experiments in this paper for different scenarios,but in conclusion, ADADELTA is well suited to handle manysituations and has great results.

You should try and use ADADELTA the next time you run anexperiment!

Matthew D. Zeiler, Presented by Kyle Franz (University of Kentucky)ADADELTA: An Adaptive Learning Rate Method March 20, 2019 13 / 13