Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim

Natural Gradient Works Efficiently in LearningS Amari

11.03.18.(Fri)Computational Modeling of Intelli-

genceSummarized by Joon Shik Kim

Abstract

• The ordinary gradient of a function does not represent its steepest direction, but the natu-ral gradient does.

• The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient.

• The plateau phenomenon, which appears in the backpropagation learning algorithm of multilayer perceptron, might disappear or might not be so serious when the natural gradient is used.

Introduction (1/2)

• The stochastic gradient method is a popular learning method in the gen-eral nonlinear optimization framework.

• The parameter space is not Euclidean but has a Riemannian metric structure in many cases.

• In these cases, the ordinary gradient does not give the steepest direction of target function.

Introduction (2/2)

• Barkai, Seung, and Sompolisky (1995) proposed an adaptive method of adjusting the learning rate. We generalize their idea and evaluate its performance based on the Riemann-ian metric of errors.

Natural Gradient (1/5)

• The squared length of a small incre-mental vector dw,

• When the coordinate system is nonorthogonal, the squared length is given by the quadratic form,


• The steepest descent direction of a function L(w) at w is defined by the vector dw has that minimizes L(w+dw) where |dw| has a fixed length, that is, under the constant,


• The steepest descent direction of L(w) in a Riemannian space is given by,



Natural Gradient Learning

• Risk function or average loss,

• Learning is a procedure to search for the optimal w* that minimizes L(w).

• Stochastic gradient descent learning

Statistical Estimation of Probability Density Function (1/2)

• In the case of statistical estimation, we assume a statistical model {p(z,w)}, and the problem is to ob-tain the probability distribution that approximates the unknown den-sity function q(z) in the best way.

• Loss function is

ˆ( , )p z w

Statistical Estimation of Probability Density Function (2/2)

• The expected loss is then given by

Hz is the entropy of q(z) not depending on w.

• Riemannian metric is Fisher informa-tion

Fisher Information as the Metric of Kullback-Leibler Divergence (1/2)

• p=q(θ+h)

( || ) lnq

D q p q dp

lnp

q dq

2

11 1

2

p pq d

q q

2

1 1( )( )

2q p q p q dq

2 2

( ( ) || ( )) 1 ( ) ( ) ( ) ( )lim limh h

D q q h q h q q h qq d

h q h h

Fisher Information as the Metric of Kullback-Leibler Divergence (2/2)

2 2

( ( ) || ( )) 1 1 ( ) ( ) ( ) ( )lim lim

2h h

D q q h q h q q h qq d

h q h h

2

1 1

2

q qq dq

1 ln ln

2

q qq d

1

2I I: Fisher information

Multilayer Neural Network (1/2)

Multilayer Neural Network (2/2)

c is a normalizing constant

Natural Gradient Gives Fisher-Efficient Online Learning Algorithms (1/4)

• DT = {(x1,y1),…,(xT,yT)} is T-indepen-dent input-output examples gener-ated by the teacher network having parameter w*.

• Minimizing the log loss over the training data DT is to obtain that minimizes the training error

ˆTw


• The Cramér-Rao theorem states that the expected squared error of an un-biased estimator satisfies

• An estimator is said to be efficient or Fisher efficient when it satisfies above equation.

1ˆ ˆ[( *) ( *)]TT TE w w w w

I


• Theorem 2. The natural gradient on-line estimator is Fisher efficient.

• Proof


Documents

Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim