Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
IGAIA 4 Bohemia
Information Geometryーー Historical Episodes and Future
with Recent Developments
Shun‐ichi AmariRIKEN Brain Science Institute
Prehistory ‐‐‐ Riemannian GeometryH. Hotteling 1929
Riemannian metric and Fisher informationlocation‐scale model : constant curvature
P. Ch. Maharanobis 1936 Euclidean distance (multivariate‐Gaussian)
C. R. Rao 1945 Cramer‐Rao Theorem; Riemannian
H. Jeffreys 1946 Bayesian theory and Jeffreys invariant prior
Dual Geometry, InvarianceN. Chentsov 1972 invariance, {g, T}, α‐connection
B. Efron 1975 (A. P. Dawid) statistical curvature; higher‐order asymptotics
O. Barndorff‐Nielsen 1976 exponential family; Legendre transform
S. Amari 1982 duality; curvature and statistics (M. Kumon)
H. Nagaoka and S. Amari 1982 duality, Pythagorean theorem
Amari’s personal history1958: statistics seminar (master course at U Tokyo)
S. Kullback, “Information and Statistics”Riemannian metric (suggested by S. Moriguti)
Gausssian : geodesic, constant curvature (Poincare‐half plane)
beautiful structure essential meaning?mathematical engineering
graph and topology of networks: homologynon‐Riemannian geometry of materials manifold: dislocationsinformation systems, learning and neural networks
2( , )N
Statistical curvature and higher‐order inferenceB. Efron, 1975
Fisher’s idea; exponential connection and mixture connectionA. P. Dawid, 1975 e‐ and m‐connections
S. Amari : α‐geometry
(Rao, Kano K? Fisher’s dream)
Amari and M.Kumon higher‐order power of statistical test
2 1 2 2 22 3
1 1 1( )e m mError G H H Kn n n
Amari paper : Ann. Statist. 1982:
Reviewers (S. Lauritzen and A.P. Dawid)
Chentsov work (handwritten manuscript)
H. Nagaoka and S. Amari 1982 (Technical Report)Ann. Probability Theory 7 reviewersZeitshrift fur Wahrsceinlichkeitstheorie und VerwandteGebietegeometry has nothing to do with statistics
IEEE Trans. Inf. Theory Shannon Theory, now well‐known
London Workshop: 1984 (D. Cox)Cox visited Japan in 1983patron of information geometry
Rao, Efron, Dawid, Barndorff‐Nielsen, LauritzenKass, Eguchi many others
Dodson, Critchley, Marriot, Komaki, Zhang, Ay, Pistone, Giblisco, Nielsen, …
Information Geometry ‐‐‐ lucky namingApplications area:statistics, time‐series and systems,machine learning, signal processing, optimizationbrain theory, consciousnessphysics, economics, mathematics
(Banach manifold, affine differential geometry and beyond) quantum information, Tsallis entropy
International ConferencesIGAIA series; GSIS series, …Many monographsnew journal (Jun Zhang); where to publishmailing list and society
still small community; united and cooperative, blessed by all
My recent works1. Systems complexity and consciousness (IIT)2. Geometry of score matching (Hyvarinen score)3. Natural gradient descent and topology of deep learning)4. Canonical divergence5. Multi‐terminal statistical inference6. Information geometry and Wasserstein distance
Information Integration and Complexity of Systems
‐‐ Stochastic approach
1x
2x
1y
2y
( , ) ( ) ( | )p p px y x y x
x: state of the brain y: next state of the brain
Integrated Information Theory G. Tononi
Φ
Necessary condition; sufficient?
full model: Disconnected model:
measure of interaction : N. Ayinformation integration : Tononi
Barrett and SethMany other
1x
2x
1y
2y
1x
2x
1y
2y
{ ( , )}FS p x y
{ ( , )}disS q x y ( | ) ( | )i iq q y x y x
Measure of information integration, or system complexity
Information Geometry N. Ay
Definition of : Postulates
1)
2)
3) Disconnected model: Markov conditions
min [ ( , ) : ( , )], disqD p x y q x y q S
( , )[ : ] ( , ) log( , )KL
p x yD D p q p x yq x y
Markov Condition
(12) branch deleted: Markov condition:
1 2 2 1 2 2 2( , | ) ( | ) ( | )p x y x p x x p y x1 2 2x x y
: all ( ) deleteddis i jS x y i j
1x
2x
1y
2y
1 2 2
2 1 1
X X YX X Y
Why KL‐divergence?
1)
2)
3)
4)
D[p : q] 0, = 0, when and only when p = q
i ii
D[p : q] = d[p(x ), q(x )]
: x
: induces flat structure dually
min [ ( , ) : ( , )], KL disqgeo D p x y q x y q S
Geometric degree of information integration
Gaussian case
=E[ ]' ' '=E[ ' ' ] ' : diagonal
| ' |log| |
T
T
geo
AA A
y x e eey x e e e
1x
2x
1y
2y
A
Gaussian case
=E[ ]' ' '=E[ ' ' ] ' : diagonal
| ' |log| |
T
T
geo
AA A
y x e eey x e e e
1x
2x
1y
2y
A
Many other definitions of Φ
Full modelDisconnected models
Full model : graphical modelFS
12 1 2 12 1 2, exp
higher-order terms
X Y X Y XYi i i i ij i jp x y x x y y x y
x y
, , ,X XYi i ij i jE x E x y
exponential family‐coordinates‐coordinates
x y
There are many disconnected models!!
Split Model : Ay, Barrett & SethHS
12 21 12
,
: min :
ˆ ˆ
0
:
H
S
i i
H KL H KLq S
i iM
XY XY Y
H i i
q q q y x
D p S D p q
q p q p y x
H Y X H
x
Y X
y x
y x p
q
HS
x yx y
1 1 2 2Y X X Y
Mixed Coordinates :
12 12 21 12 21 12, , , , ; , ,
ˆˆ , ;
X X Y XY XY XY XY Yi i
q p q
x y
12 21 12 ˆ: 0; XY XY YH q qS
x y
Markovian Condition
1 1 2 2
1 2 2 1 1 2
;
Y X X YX X Y Y X X
0, >0Yind Xp p p I x y
problem 0 :p I X Y
Split Model GrS
1 2 2 1 1 2 1 2 2
12 21
1
,
, , , ,
:
ˆ ˆ,
ˆ
0
0
X Y i i
X Y Y Y
i
XY XY
i i i
q q q q y x
q x y x y q x x y q y x y
I X Y
q p q p
q y x p y x
x y x y
x x y y
graphical model
Problem: Gaussian channel( : p A x, y) y x ˆˆ : Gq A S y x
A is not diagonal 12 21 0XY XY
A
:
1 11, exp2 X
p A A
x y x x y x y x
Mismatched Decoding Model MS
Best mismatched decoding from y to x1x
2x
1y
2y
{ ( , ) ( ) ( ) (y | x ) }
* [ : ]
M i i
KL M
S q p p p
D p S
x y x y
ˆy x
, : , dually flat: , not flat
, ;
H Gr Geo Gr H
M Gr M Geo
Gr H M Geo H
S S S S SS S S S
geo
Transfer Entropy; Granger causality[ ] min [ : ] ( ) disconnected
[ | ] [ | ]i j KL
j i j
TE x y D p q q i j
H Y X X H Y X
Non‐additive[ ; ] [ ] [ ]i j k m i j k mTE x y x y TE x y TE x y
1x
2x
1y
2y
Hierarchy: transfer entropy
,
,
i i j
i i j
X X X X
Y Y Y Y
cutting branchessplit models
x y
Partition of XX Y
subadditivity
Information Geometry of Hyvärinen Game ScoreFollowing Grinwald, Dawid, Parry, Lauritzen, Hyvärinen
, , :
, , , log
-entropy : ,
-divergence : : :
p x
S p
S S S
L p q E l x a a q x
S x q l x q l x q q x
S H p q E S x q
S D p q H p q H p p
Hyvärinen score
2
2
1, , ,2
1: log log ,2
: :
, , , ,
S p
S x q l x l x
dD p q E p x q xdx
D p cq D p q
x l x l x l x
s
parametric case { , }:M q A x
min : ,SD p q x
1
, , : estimating function
, 0 : estimating equationN
ii
S
s x x
s x
Information geometry of :SD p q
Asymptotic Analysis of estimator
1 11T TE K VK GN
,
, , T
K E
V E
s x
s x s x
G
1 1
, ,
, log , ,
T
T
E G AG
A E
c
a x a x
s x x a x
log : efficientq a
Hyvärinen score
2
2
1, , ,2
1: log log ,2
: :
, , , ,
S p
S x q l x l x
dD p q E p x q xdx
D p cq D p q
x l x l x l x
s
Hyvärinen estimator
Fisher efficient q multivariate Gaussian
Discrete case : graph nodesx
1Nx
f f fN
x
x x x
constxN
x
x
2
2
,
, 2
:S q
q qS q
q q
q , q ,D E
q , q ,
f h f h
f x h x dx f x h x dx
x
x x
x xx
x x
x xx x
x x x x
, 0f hx
Deep LearningSelf‐Organization + Supervised LearningRBM: Restricted Boltzmann MachineAuto‐Encoder, Recurrent Net
DropoutContrastive divergence
convolution
Mathematical Neurons
i iy w x h w x
x y( )u
u
Multilayer Perceptrons
i iy v w x
, i if v x w x
1 2( , ,..., )nx x x x
1 1( ,..., ; ,..., )m mw w v v
1 w x
yx
Multilayer Perceptron
1 1,
,
, ; ,i i
m m
y f
v
v v
x θ
w x
θ w w
neuromanifold ( )x
space of functions S
Backpropagation --- stochastic gradient learningBackpropagation --- stochastic gradient learning
1 1
2
examples : , , , training set1( , ; ) ,2
log , ;
t ty y
l y x y f
p y
x x
x
x
( , ; )
,
t t tt t
i i
l y x
f v
x w x
singularities
Geometry of singular model
y v n w x
W
vv | | 0w
model: 2 hidden neurons
2
1 1 2 2
2
,
,
12
tu
f w w
y f
u e dt
x J x J x
x
2
0
1: , ; ,2
: teacher signal :
, ,
t t t
l y y f
y
l y
stochasti
loss fu
backprop : vanilla gradient
c descent learnin
nction
g
x x
x
1 , ,
:
t t t tG y
G l l
Fisher
Natural Gradient Stochastic Descent
Information Matrix
invarint; steepest descent
x
Natural gradient is superior
Steepest descent; invariant Yan Ollivier
Fisher‐efficient
Natural gradient is non‐vanishing even in multiple layers
Good at singular regions (avoid plateaus: Milnor attractor)
Adaptive Natural Gradient
Singular Region in Parameter Space
1 2 1 2
1 2 2
1 2 1
1 1 2 2
, ,
0, ,
, 0,
,
R w w w w
w w w
w w w
f w w
J J J J
J J
J J
x J x J x
Coordinate transformation
1 1 2 2
1 2
1 2
2 1
2 1
1 2
,
,
,
, , ,
w ww w
w w w
w wzw w
w z
J Jv
u J J
v u
Singular Region , 0 1R w z J u
Milnor attractor
Fig. 2: trajectories
Saddle and plateau
Topology of singular R
2 21
2 32
blow-down coordinates , ,
1 ,
1 ,
, 1n
c z u u
c z z u
S
: = e
u
ue eu
Singular Region , 0 1R w z J u
Sphere Sn and Projective space Pn
natural gradient learning near singularity
:
1 :
d Rdt
d O Rdt
true model
true model
Milnor attractor
Canonical Divergence in Manifoldof Dual Affine Connections
Nihat Ay and S. Amari
Divergence and metric
3
: 01:2
i jij
D p q
D d g d d O d
: Riemannian metric, positive‐definiteG
Divergence and dual affine connections
:
:
;
ijk ijk
ijk i j k
ijk i j k
i ji j
D
D
Dual geometry
, , ,
, , ,
, , ,
2
X X
ijk ijk ijk
oijk ijk ijk
M g
X Y Z Y Z Y Z
M g T T
T
: Levi‐Civita connectiono
Dual geometry canonical divergence
: dually flat : ,
:
M
D
Bregman divergence
Exponential map : geodesic t
0
0
1
0 log p
p
q
X q
Xp
q
p
q
Exponential map divergence
2: :D p q X p q
‐divergence
2: :D p q X p q
‐geodesic
Theorem 1. Exponential map divergenceinduces geometry3
Standard divergence: 2stan 1/3: ,D p q X p q
Theorem 2. exponential mapdivergence recovers the original geometry
13
1
0 ,
1 2
0 ,
1 2
0 ,
[ : ] ( , ), ( )
|| ( ) ||
( ) || ( ) ||
t q p
q p
q p
D p q X q p t dt
t t dt
w t t dt
ξ
ξ
ξ
Divergence and projection
projection theorem:
ˆ arg min :q S
p D p q
grad :qX c D p q
p
p
S
IEEE ISIT‐2011 Sankt Petersburg
Data Compression in Multiterminal Statistical Inference
Shun‐ichi AmariRIKEN Brain Science Institute
A long standing problemT.Berger; Csiszar, Ahlswede, Burnashev, Han, Amari
correlated sources X, Y
data compression and statistical inference1 2 : nX x x x bitsXk
q
1 2 : nY y y y bitsYk
, ; ; p x y q iid
1binary : , 0,1 ; Prob 1 Prob 12
x y x y
, ; ; p x y q iid Prob x y q
1 2 : nX x x x bitsXkq
1 2 : nY y y y bitsYk
Encoding : data compression
x
nX
c
2nx 2 Xkc
N bits K bits
One‐bit helper case1, X Yk k n
c
y sgnc a x
1 2 : nX x x x c 1bitq
1 2 : nY y y y n bits
Is single‐bit encoding optimal?
when 1 / 2q
but not for general .
( independent), ,x y
q
It is optimal
Fisher information: 1,X Yk k n
Kingo Kobayashi: parity encoding
1 2 sx x x
in progress!
Information Geometry and Transportation Problem (Wasserstein distance)
entropic relaxation : min <c, p> ‐ a{‐H(p)} dual
New Paper
S. Pal and T‐K L. Wong, Exponentially concave function and a new information geometry
Portfolio theory, transportation problem and information geometry(dually projectively flat)