Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Introduction to advanced parameter optimization
So far:
• What is a neural network?
• Basic training algorithm:
• Gradient descent
• Backpropagation
Next: advanced training algorithms
Gradient descent algorithm
1. Choose an initial weight vector and let .
2. Let .
3. Evaluate .
4. Let .
5. Let and go to step 2.
w
1
d
1
g
1
–=
w
j
1
+
w
j
η
d
j
+=
g
j
1
+
d
j
1
+
g
j
1
+
–=
j j
1
+=
Gradient descent review
Gradient descent:
where,
Two main problems:
• Slow convergence
• Trial-and-error selection of
Goal
: cut number of epochs (training cycles) by orders of magnitude ... how?
w
t
1
+
( )
w
t
( )
w
t
( )∆
+=
w
t
( )∆ η
E
w
t
( )[ ]∇
–=
η
How to improve over gradient descent?
• Must understand convergence properties
• Use second-order information...
First case study
(same as last time)
Will also look at simple nonquadratic error surfaces...
E
20
ω
12
ω
22
+=
-1.5
-1
-0.5
0
0.5
1 -0.5
0
0.5
1
1.5
2
0
20
40
-1.5
-1
-0.5
0
0.5
1
ω
1
ω
2
E
Why quadratic error surface?
Disadvantages:
• Too simple/too few parameters
• NN error surface not
globally
quadratic
Advantage:
• Easy to visualize
• NN error surfaces will be
locally
quadratic near a local minimum.
Taylor series expansion
Single dimension (from calculus):
Multi-dimensional error surface:
about some vector , where,
(
Hessian:
not just a German mercenary)
f x
( )
f x
0
( )
f
'
x
0
( )
x x
0
–
( )
f
''
x
0
( )
x x
0
–
( )
2
+ +
≈
E
w
( )
E
w
0
( )
w w
0
–
( )
T
b
12
---
w w
0
–
( )
T
H
w w
0
–
( )
+ +
≈
w
0
b
E
w
0
( )∇
=
H E
w
0
( )∇[ ]∇
=
Hessian matrix
Definition
: The
Hessian
matrix of a -dimensional function is defined as,
where,
Alter natively:
W W
×
H
W
E
w
( )
H E
w
( )∇[ ]∇
=
w
ω
1
ω
2
… ω
WT
=
H
i j
,( )
∂
2
E
∂ω
i
∂ω
j
------------------=
Some linear algebra
Definition
: For a square matrix , the
eigenvalues
are the solution of,
Definition
: A square matrix is
positive-definite
, if and only if all its eigenvalues are greater than zero. If a matrix is positive-definite, then,
, .
• Quadratic error surface:
• Arbitrary error surface: near local minimum.
W W
×
H
λ
λ
I
W
H
–
0
=
H
λ
i
v
T
H
v
0
>
v
∀
0
≠
H
0
>
H
0
>
Gradient descent convergence rate
Near local minimum:
(why?)
Convergence governed by:
Learning rate bound:
λ
min
0
>
λ
min
λ
max
------------
0
η
2
λ
max
------------
< <
Simple Hessian example
First partial derivatives:
Second partial derivatives:
E
20
ω
12
ω
22
+=
E
∂ω
1
∂
---------
40
ω
1
=
E
∂ω
2
∂
---------
2
ω
2
=
∂
2
E
∂ω
12
----------
40
=
∂
2
E
∂ω
22
----------
2
=
∂
2
E
∂ω
1
∂ω
2
--------------------
∂
2
E
∂ω
2
∂ω
1
--------------------
0
= =
Simple Hessian example (continued)
Second partial derivatives:
Hessian:
E
20
ω
12
ω
22
+=
∂
2
E
∂ω
12
----------
40
=
∂
2
E
∂ω
22
----------
2
=
∂
2
E
∂ω
1
∂ω
2
--------------------
∂
2
E
∂ω
2
∂ω
1
--------------------
0
= =
H
40 0
0 2
=
Simple Hessian example (continued)
What are the eigenvalues?
H
40 0
0 2
=
Computation of eigenvalues
λ
I
2
H
–
0
=
λ
1 0
0 1
40 0
0 2
–
0
=
λ
40
–
0
0
λ
2
–
0
=
λ
40
–
( ) λ
2
–
( )
0
=
λ
min
2
=
λ
max
40
=
Learning rate bounds
(same as fixed-point derivation)
λ
min
2
=
λ
max
40
=
0
η
2
λ
max
------------
< <
0
η
240
------
< <
0.05
=
Convergence examples
-1.5 -1 -0.5 0 0.5 1
-0.5
0
0.5
1
1.5
2
ω
1
ω
2
η
0.01
=
719 steps
Convergence examples
-1.5 -1 -0.5 0 0.5 1
-0.5
0
0.5
1
1.5
2
ω
2
ω
1
η
0.04
=
175 steps
Convergence examples
-1.5 -1 -0.5 0 0.5 1
-0.5
0
0.5
1
1.5
2
ω
1
η
0.05
=
no convergence
Basic pr oblem: “long v alley with steep sides”
What characterizes a “ long valley with steep sides?”
Length of contour lines proportional to:
and
Small ratios:
So what can we do about this?
1
λ
1
----------
1
λ
2
----------
λ
min
λ
max
------------
Solution
• Fixed learning rate is the problem
• Answer: different learning rates for each weight.
Key question: how to achieve automatically?
Heuristic extension: momentum
Gradient descent with momentum:
, ,
Notes:
• dependent on and
• Ideally, high
effective
learning rate in shallow dimensions
• Little effect along steep dimensions
µ
w
t
1
+
( )
w
t
( )
w
t
( )∆
+=
w
0
( )∆ η
E
w
0
( )[ ]∇
–=
w
t
( )∆ η
E
w
t
( )[ ]∇
–
µ∆
w
t
1
–
( )
+=
t
0
>
0
µ≤
1
<
w
t
( )∆
w
t
( )
w
t
1
–
( )
Gradient descent algorithm
1. Choose an initial weight vector and let .
2. Let .
3. Evaluate .
4. Let .
5. Let and go to step 2.
w
1
d
1
g
1
–=
w
j
1
+
w
j
η
d
j
+=
g
j
1
+
d
j
1
+
g
j
1
+
–=
j j
1
+=
Analyzing momentum term
Shallow regions: assume,
,
Then:
E
w
t
( )∇
E
w
0
( )∇≈
g
=
t
1 2
…, ,{ }∈
w
0
( )∆ η
g
–=
w
1
( )∆ η
g
–
µ
w
0
( )∆
+
≈ η
g
1
µ
+
( )
–=
w
2
( )∆ η
g
–
µ
w
1
( )∆
+
≈ η
g
–
µ η
g
1
µ
+
( )
–
[ ]
+=
w
2
( )∆ η
g
1
µ µ
2
+ +
( )
–
≈
Analyzing momentum term
Assumption (shallow region):
,
In general,
In the limit:
E
w
t
( )∇
E
w
0
( )∇≈
g
=
t
1 2
…, ,{ }∈
w
t
( )∆ η
g
µ
s
s
0
=
t
∑
–
≈ η
1
µ
t
1
+
–
1
µ
–---------------------
g
–=
w
t
( )∆ η
–
1
µ
–
( )
-----------------
g
≈
t
∞→
lim
Analyzing momentum term
Effective learning rate (shallow regions):
η
1
µ
–
( )⁄
Analyzing momentum term
Steep regions: oscillations
Net effect (ideally): little
E
w
t
1
+
( )[ ]∇
E
w
t
( )[ ]∇
–
≈
Momentum
Advantage:
• Increase
effective
learning rate in shallow regions
Disadvantages:
• Yet another parameter to hand tune
• If not carefully chosen, can do more harm than good
µ
Convergence examples
-1.5 -1 -0.5 0 0.5 1
-0.5
0
0.5
1
1.5
2
ω
1
ω
2
η
0.01
µ,
0.0
= =
719 steps
Convergence examples
-1.5 -1 -0.5 0 0.5 1
-0.5
0
0.5
1
1.5
2
ω
1
ω
2
η
0.01
µ,
0.5
= =
341 steps
Convergence examples
-1.5 -1 -0.5 0 0.5 1
-0.5
0
0.5
1
1.5
2
ω
1
ω
2
η
0.01
µ,
0.9
= =
266 steps
Convergence examples
-1.5 -1 -0.5 0 0.5 1
-0.5
0
0.5
1
1.5
2
ω
2
ω
1
η
0.04
µ,
0.0
= =
175 steps
Convergence examples
-1.5 -1 -0.5 0 0.5 1
-0.5
0
0.5
1
1.5
2
ω
2
ω
1
η
0.04
µ,
0.5
= =
60 steps
Convergence examples
-1.5 -1 -0.5 0 0.5 1
-1
-0.5
0
0.5
1
1.5
2
ω
2
ω
1
η
0.04
µ,
0.9
= =
272 steps
Convergence examples: summary
719 341 266
175 60 272
µ
0.0
=
µ
0.5
=
µ
0.9
=
η
0.01
=
η
0.04
=
Heuristic extensions to gradient descent
Momentum popular in neural network community.
Many other heuristic attempts (some examples):
• Adaptive learning rate (what should and be?)
• (what’s the problem here?)
ρ
σ
η
new
ρη
old
E
∆
0
<
ση
old
E
∆
0
>
=
η
max
2
λ
max
⁄
=
Heuristic extensions to gradient descent
• Individual learning rates:
(problems?)
• Quickprop: local independent quadratic assumption:
(problems?)
η
i
∆ γ
g
it
( )
g
it
1
–
( )
=
ω
i
∆
t
1
+
( )
g
it
( )
g
it
1
–
( )
g
it
( )
–------------------------------
ω
i
∆
t
( )
=
Heuristic extensions to gradient descent
Problems:
• Additional hand-tuned parameters
•
Independence of weight assumptions
More principled approach is desirable.
Steepest descent
Gradient descent:
where,
Question: why take all those little tiny steps?
w
t
1
+
( )
w
t
( )
w
t
( )∆
+=
w
t
( )∆ η
E
w
t
( )[ ]∇
–=
Steepest descent: gradient descent with line minimization
1. Define search direction :
2. Minimize:
such that:
,
3. New update:
(problems?)
d
t
( )
d
t
( )
E
w
t
( )[ ]∇
–=
E
η( )
E
w
t
( ) η
d
t
( )
+
[ ]≡
E
η∗( )
E
η( )≤
η∀
w
t
1
+
( )
w
t
( ) η∗
d
t
( )
+=
Steepest descent
Question: Do we need to compute ?
Answer: No. Use
one-dimensional
line search
, which requires only evaluation of .
Line search: two steps
1. Bracket minimum
2. Line minimization
∂
E
∂η⁄
E
η( )
Line search: bracketing the minimum
Basic problem: need three values such that:
a b c
, ,
E a
( )
E b
( )>
E c
( )
E b
( )>
1 2 3 4 5 6 7 8
1
2
3
4
5
6
η
E
η()
a
bc
Bracketing the minimum
1. Let . Let .
Note: will satisfy (why?).
2. Let , where (what should be?).
3. If , then done; else, let and . Repeat step 2.
Note: one evaluation of per step.
a
0
=
b
ε
=
E a
( )
E b
( )>
c k b a
–
( )
a
+=
k
1
>
k
E c
( )
E b
( )>
a b
=
b c
=
E
Bracketing example
1 2 3 4 5 6 7 8
1
2
3
4
5
6
1 2 3 4 5 6 7 8
1
2
3
4
5
6
1 2 3 4 5 6 7 8
1
2
3
4
5
6
1 2 3 4 5 6 7 8
1
2
3
4
5
6
η
η
η
η
E
η()
E
η()
E
η()
E
η()
#1 #2
#3 #4
Bracketing example: error surface
E
ω
1
ω
2
,( )
1 5
ω
12
–
ω
22
–
( )
exp
–=
-2
-1
0
1
2 -2
-1
0
1
2
0
0.25
0.5
0.75
1
-2
-1
0
1
2
ω
2
ω
1
What is ?
Weights
At :
E
η( )
ω
1
ω
2
,( )
1 2
,( )
=
E
ω
1
ω
2
,( )∇
10
ω
1
5
ω
12
–
ω
22
–
( )
exp 2
ω
2
,[ ]
5
ω
12
–
ω
22
–
( )
exp
=
E
η( )
1 5
ω
1
10
ω
1
η
5
ω
12
–
ω
22
–
( )
exp
–
( )
2
–
–
(
exp
–=
ω
2
2
ω
2
η
5
ω
12
–
ω
22
–
( )
exp
–
( )
2
)
ω
1
ω
2
,( )
1 2
,( )
=
E
η( )
1 5 1 10
η
9
–
( )
exp
–
( )
2
–
2 4
η
9
–
( )
exp
–
( )
2
–
( )
exp
–=
Bracketing example: error surface
0 200 400 600 800 1000 1200 1400
0.92
0.94
0.96
0.98
1
η
E
η()
a
b
c
Line minimization
1. Pick a value of in larger interval: or .
2. If is larger interval, set new bracketing values to:
if , (set ), or
, if , (set and ).
Else, set new bracketing values to,
if , (set ), or
, if , (set and ).
3. Iterate steps 1 and 2 until .
η
x
=
a b
,( )
b c
,( )
a b
,( )
x b c
, ,{ }
E x
( )
E b
( )>
a x
=
a x b
, ,{ }
E b
( )
E x
( )>
c b
=
b x
=
a b x
, ,{ }
E x
( )
E b
( )>
c x
=
b x c
, ,{ }
E b
( )
E x
( )>
a b
=
b x
=
c a
–
( ) θ<
Line minimization
What should the value of be?
[ is larger interval]
[ is larger interval]
Rate of convergence proportional to:
(golden mean)
x
x
0.381966
c b
–
( )
b
+=
b c
,( )
x b
0.381966
b a
–
( )
–=
a b
,( )
1
k
---
0.61803
≈
k
1 5
+
2
----------------
1.61803
≈
=
Examples: quadratic surface
-1 -0.5 0 0.5 1
0
0.5
1
1.5
2
ω
2
ω
1
steepest descent15 steps to convergence
Examples: quadratic surface (comparison)
-1 -0.5 0 0.5 10
0.5
1
1.5
2
ω
2
ω
1
η
0.04
=
( )
gradient descent175 steps to convergence
Examples: nonquadratic surface
-1 -0.5 0 0.5 1
0
0.5
1
1.5
2
ω
2
ω
1
steepest descent24 steps to convergence
Examples: nonquadratic surface
-1 -0.5 0 0.5 1 1.5
-1.5
-1
-0.5
0
0.5
1
1.5
2
ω
2
ω
1
η
0.2
=
( )
gradient descent456 steps to convergence
Computational complexity
Steepest descent:
computations/step (why?)
Gradient descent:
computations/step
5
NW
10 2
NW
( )
+
25
NW
=
5
NW
Discussion
What’ s bad about steepest descent?
Answer: Orthogonality of consecutive steps.
• Why does this occur?
• Can we do better?