IGAIA 4 Bohemia Information Geometryigaia.utia.cz/slides/amari.pdf · Prehistory ‐‐‐Riemannian Geometry H. Hotteling 1929 Riemannian metric and Fisher information location‐scale

IGAIA 4 Bohemia

Information Geometryーー Historical Episodes and Future

with Recent Developments

Shun‐ichi AmariRIKEN Brain Science Institute

Prehistory ‐‐‐ Riemannian GeometryH. Hotteling 1929

Riemannian metric and Fisher informationlocation‐scale model : constant curvature

P. Ch. Maharanobis 1936 Euclidean distance (multivariate‐Gaussian)

C. R. Rao 1945 Cramer‐Rao Theorem; Riemannian

H. Jeffreys 1946 Bayesian theory and Jeffreys invariant prior

Dual Geometry, InvarianceN. Chentsov 1972 invariance, {g, T}, α‐connection

B. Efron 1975 (A. P. Dawid) statistical curvature; higher‐order asymptotics

O. Barndorff‐Nielsen 1976 exponential family; Legendre transform

S. Amari 1982 duality; curvature and statistics (M. Kumon)

H. Nagaoka and S. Amari 1982 duality, Pythagorean theorem

Amari’s personal history1958: statistics seminar (master course at U Tokyo)

S. Kullback, “Information and Statistics”Riemannian metric (suggested by S. Moriguti)

Gausssian : geodesic, constant curvature (Poincare‐half plane)

beautiful structure essential meaning?mathematical engineering

graph and topology of networks: homologynon‐Riemannian geometry of materials manifold: dislocationsinformation systems, learning and neural networks

2( , )N

Statistical curvature and higher‐order inferenceB. Efron, 1975

Fisher’s idea; exponential connection and mixture connectionA. P. Dawid, 1975 e‐ and m‐connections

S. Amari : α‐geometry

(Rao, Kano K? Fisher’s dream)

Amari and M.Kumon higher‐order power of statistical test

2 1 2 2 22 3

1 1 1( )e m mError G H H Kn n n

Amari paper : Ann. Statist. 1982:

Reviewers (S. Lauritzen and A.P. Dawid)

Chentsov work (handwritten manuscript)

H. Nagaoka and S. Amari 1982 (Technical Report)Ann. Probability Theory 7 reviewersZeitshrift fur Wahrsceinlichkeitstheorie und VerwandteGebietegeometry has nothing to do with statistics

IEEE Trans. Inf. Theory Shannon Theory, now well‐known

London Workshop: 1984 (D. Cox)Cox visited Japan in 1983patron of information geometry

Rao, Efron, Dawid, Barndorff‐Nielsen, LauritzenKass, Eguchi many others

Dodson, Critchley, Marriot, Komaki, Zhang, Ay, Pistone, Giblisco, Nielsen, …

Information Geometry ‐‐‐ lucky namingApplications area:statistics, time‐series and systems,machine learning, signal processing, optimizationbrain theory, consciousnessphysics, economics, mathematics

(Banach manifold, affine differential geometry and beyond) quantum information, Tsallis entropy

International ConferencesIGAIA series; GSIS series, …Many monographsnew journal (Jun Zhang); where to publishmailing list and society

still small community; united and cooperative, blessed by all

My recent works1. Systems complexity and consciousness (IIT)2. Geometry of score matching (Hyvarinen score)3. Natural gradient descent and topology of deep learning)4. Canonical divergence5. Multi‐terminal statistical inference6. Information geometry and Wasserstein distance

Information Integration and Complexity of Systems

‐‐ Stochastic approach

1x

2x

1y

2y

( , ) ( ) ( | )p p px y x y x

x: state of the brain y: next state of the brain

Integrated Information Theory G. Tononi

Φ

Necessary condition; sufficient?

full model: Disconnected model:

measure of interaction : N. Ayinformation integration : Tononi

Barrett and SethMany other

1x

2x

1y

2y

1x

2x

1y

2y

{ ( , )}FS p x y

{ ( , )}disS q x y ( | ) ( | )i iq q y x y x

Measure of information integration, or system complexity

Information Geometry N. Ay

Definition of : Postulates

1)

2)

3) Disconnected model: Markov conditions

min [ ( , ) : ( , )], disqD p x y q x y q S

( , )[ : ] ( , ) log( , )KL

p x yD D p q p x yq x y

Markov Condition

(12) branch deleted: Markov condition:

1 2 2 1 2 2 2( , | ) ( | ) ( | )p x y x p x x p y x1 2 2x x y

: all ( ) deleteddis i jS x y i j

1x

2x

1y

2y

1 2 2

2 1 1

X X YX X Y

Why KL‐divergence?

1)

2)

3)

4)

D[p : q] 0, = 0, when and only when p = q

i ii

D[p : q] = d[p(x ), q(x )]

: x

: induces flat structure dually

min [ ( , ) : ( , )], KL disqgeo D p x y q x y q S

Geometric degree of information integration

Gaussian case

=E[ ]' ' '=E[ ' ' ] ' : diagonal

| ' |log| |

T

T

geo

AA A

y x e eey x e e e

1x

2x

1y

2y

A

Gaussian case

=E[ ]' ' '=E[ ' ' ] ' : diagonal

| ' |log| |

T

T

geo

AA A

y x e eey x e e e

1x

2x

1y

2y

A

Many other definitions of Φ

Full modelDisconnected models

Full model : graphical modelFS

12 1 2 12 1 2, exp

higher-order terms

X Y X Y XYi i i i ij i jp x y x x y y x y

x y

, , ,X XYi i ij i jE x E x y

exponential family‐coordinates‐coordinates

x y

There are many disconnected models!!

Split Model : Ay, Barrett & SethHS

12 21 12

,

: min :

ˆ ˆ

0

:

H

S

i i

H KL H KLq S

i iM

XY XY Y

H i i

q q q y x

D p S D p q

q p q p y x

H Y X H

x

Y X

y x

y x p

q

HS

x yx y

1 1 2 2Y X X Y

Mixed Coordinates :

12 12 21 12 21 12, , , , ; , ,

ˆˆ , ;

X X Y XY XY XY XY Yi i

q p q

x y

12 21 12 ˆ: 0; XY XY YH q qS

x y

Markovian Condition

1 1 2 2

1 2 2 1 1 2

;

Y X X YX X Y Y X X

0, >0Yind Xp p p I x y

problem 0 :p I X Y

Split Model GrS

1 2 2 1 1 2 1 2 2

12 21

1

,

, , , ,

:

ˆ ˆ,

ˆ

0

0

X Y i i

X Y Y Y

i

XY XY

i i i

q q q q y x

q x y x y q x x y q y x y

I X Y

q p q p

q y x p y x

x y x y

x x y y

graphical model

Problem: Gaussian channel( : p A x, y) y x ˆˆ : Gq A S y x

A is not diagonal 12 21 0XY XY

A

:

1 11, exp2 X

p A A

x y x x y x y x

Mismatched Decoding Model MS

Best mismatched decoding from y to x1x

2x

1y

2y

{ ( , ) ( ) ( ) (y | x ) }

* [ : ]

M i i

KL M

S q p p p

D p S

x y x y

ˆy x

, : , dually flat: , not flat

, ;

H Gr Geo Gr H

M Gr M Geo

Gr H M Geo H

S S S S SS S S S

geo

Transfer Entropy; Granger causality[ ] min [ : ] ( ) disconnected

[ | ] [ | ]i j KL

j i j

TE x y D p q q i j

H Y X X H Y X

Non‐additive[ ; ] [ ] [ ]i j k m i j k mTE x y x y TE x y TE x y

1x

2x

1y

2y

Hierarchy: transfer entropy

,

,

i i j

i i j

X X X X

Y Y Y Y

cutting branchessplit models

x y

Partition of XX Y

subadditivity

Information Geometry of Hyvärinen Game ScoreFollowing Grinwald, Dawid, Parry, Lauritzen, Hyvärinen

, , :

, , , log

-entropy : ,

-divergence : : :

p x

S p

S S S

L p q E l x a a q x

S x q l x q l x q q x

S H p q E S x q

S D p q H p q H p p

Hyvärinen score

2

2

1, , ,2

1: log log ,2

: :

, , , ,

S p

S x q l x l x

dD p q E p x q xdx

D p cq D p q

x l x l x l x

s

parametric case { , }:M q A x

min : ,SD p q x

1

, , : estimating function

, 0 : estimating equationN

ii

S

s x x

s x

Information geometry of :SD p q

Asymptotic Analysis of estimator

1 11T TE K VK GN

,

, , T

K E

V E

s x

s x s x

G

1 1

, ,

, log , ,

T

T

E G AG

A E

c

a x a x

s x x a x

log : efficientq a

Hyvärinen score

2

2

1, , ,2

1: log log ,2

: :

, , , ,

S p

S x q l x l x

dD p q E p x q xdx

D p cq D p q

x l x l x l x

s

Hyvärinen estimator

Fisher efficient q multivariate Gaussian

Discrete case : graph nodesx

1Nx

f f fN

x

x x x

constxN

x

x

2

2

,

, 2

:S q

q qS q

q q

q , q ,D E

q , q ,

f h f h

f x h x dx f x h x dx

x

x x

x xx

x x

x xx x

x x x x

, 0f hx

Deep LearningSelf‐Organization + Supervised LearningRBM: Restricted Boltzmann MachineAuto‐Encoder, Recurrent Net

DropoutContrastive divergence

convolution

Mathematical Neurons

i iy w x h w x

x y( )u

u

Multilayer Perceptrons

i iy v w x

, i if v x w x

1 2( , ,..., )nx x x x

1 1( ,..., ; ,..., )m mw w v v

1 w x

yx

Multilayer Perceptron

1 1,

,

, ; ,i i

m m

y f

v

v v

x θ

w x

θ w w

neuromanifold ( )x

space of functions S

Backpropagation --- stochastic gradient learningBackpropagation --- stochastic gradient learning

1 1

2

examples : , , , training set1( , ; ) ,2

log , ;

t ty y

l y x y f

p y

x x

x

x

( , ; )

,

t t tt t

i i

l y x

f v

x w x

singularities

Geometry of singular model

y v n w x

Ｗ

vv | | 0w

model: 2 hidden neurons

2

1 1 2 2

2

,

,

12

tu

f w w

y f

u e dt

x J x J x

x

2

0

1: , ; ,2

: teacher signal :

, ,

t t t

l y y f

y

l y

stochasti

loss fu

backprop : vanilla gradient

c descent learnin

nction

g

　

x x

x

1 , ,

:

t t t tG y

G l l

Fisher

Natural Gradient Stochastic Descent

Information Matrix

invarint; steepest descent

　

x

Natural gradient is superior

Steepest descent; invariant Yan Ollivier

Fisher‐efficient

Natural gradient is non‐vanishing even in multiple layers

Good at singular regions (avoid plateaus: Milnor attractor)

Adaptive Natural Gradient

Singular Region in Parameter Space

1 2 1 2

1 2 2

1 2 1

1 1 2 2

, ,

0, ,

, 0,

,

R w w w w

w w w

w w w

f w w

J J J J

J J

J J

x J x J x

Coordinate transformation

1 1 2 2

1 2

1 2

2 1

2 1

1 2

,

,

,

, , ,

w ww w

w w w

w wzw w

w z

J Jv

u J J

v u

Singular Region , 0 1R w z J u

Milnor attractor

Fig. 2: trajectories

Saddle and plateau

Topology of singular R

2 21

2 32

blow-down coordinates , ,

1 ,

1 ,

, 1n

c z u u

c z z u

S

: = e

u

ue eu

Singular Region , 0 1R w z J u

Sphere Sn and Projective space Pn

natural gradient learning near singularity

:

1 :

d Rdt

d O Rdt

　true model

　true model

Milnor attractor

Canonical Divergence in Manifoldof Dual Affine Connections

Nihat Ay and S. Amari

Divergence and metric

3

: 01:2

i jij

D p q

D d g d d O d

: Riemannian metric, positive‐definiteG

Divergence and dual affine connections

:

:

;

ijk ijk

ijk i j k

ijk i j k

i ji j

D

D

Dual geometry

, , ,

, , ,

, , ,

2

X X

ijk ijk ijk

oijk ijk ijk

M g

X Y Z Y Z Y Z

M g T T

T

: Levi‐Civita connectiono

Dual geometry canonical divergence

: dually flat : ,

:

M

D

Bregman divergence

Exponential map : geodesic t

0

0

1

0 log p

p

q

X q

Xp

q

p

q

Exponential map divergence

2: :D p q X p q

‐divergence

2: :D p q X p q

‐geodesic

Theorem 1. Exponential map divergenceinduces geometry3

Standard divergence: 2stan 1/3: ,D p q X p q

Theorem 2. exponential mapdivergence recovers the original geometry

13

1

0 ,

1 2

0 ,

1 2

0 ,

[ : ] ( , ), ( )

|| ( ) ||

( ) || ( ) ||

t q p

q p

q p

D p q X q p t dt

t t dt

w t t dt

ξ

ξ

ξ

Divergence and projection

projection theorem:

ˆ arg min :q S

p D p q

grad :qX c D p q

p

p

S

IEEE ISIT‐2011 Sankt Petersburg

Data Compression in Multiterminal Statistical Inference

Shun‐ichi AmariRIKEN Brain Science Institute

A long standing problemT.Berger; Csiszar, Ahlswede, Burnashev, Han, Amari

correlated sources X, Y

data compression and statistical inference1 2 : nX x x x bitsXk

q

1 2 : nY y y y bitsYk

, ; ; p x y q iid

1binary : , 0,1 ; Prob 1 Prob 12

x y x y

, ; ; p x y q iid Prob x y q

1 2 : nX x x x bitsXkq

1 2 : nY y y y bitsYk

Encoding : data compression

x

nX

c

2nx 2 Xkc

N bits K bits

One‐bit helper case1, X Yk k n

c

y sgnc a x

1 2 : nX x x x c 1bitq

1 2 : nY y y y n bits

Is single‐bit encoding optimal?

when 1 / 2q

but not for general .

( independent), ,x y

q

It is optimal

Fisher information: 1,X Yk k n

Kingo Kobayashi: parity encoding

1 2 sx x x

in progress！

Information Geometry and Transportation Problem (Wasserstein distance)

entropic relaxation : min <c, p> ‐ a{‐H(p)} dual

New Paper

S. Pal and T‐K L. Wong, Exponentially concave function and a new information geometry

Portfolio theory, transportation problem and information geometry(dually projectively flat)

Documents

IGAIA 4 Bohemia Information Geometryigaia.utia.cz/slides/amari.pdf · Prehistory ‐‐‐Riemannian Geometry H. Hotteling 1929 Riemannian metric and Fisher information location‐scale