Upload
others
View
20
Download
0
Embed Size (px)
Citation preview
Mathematics for Artificial Intelligence
Reading Course
Elena Agliari Dipartimento di Matematica Sapienza Università di Roma
(TENTATIVE) PLAN OF THE COURSE Introduction Chapter 1: Basics of statistical mechanics.
The Curie-Weiss modelChapter 2: Neural networks for associative memory and pattern recognition.Chapter 3: The Hopfield model
Hopfield model with low-load and solution via log-constrained entropy.Hopfield model with high-load and solution via stochastic stability.
Chapter 4: Beyond the Hebbian paradigmaChapter 5: A gentle introduction to machine learning
Maximum likelihoodRosenblatt and Minsky&Papert perceptrons.
Chapter 6: Neural networks for statistical learning and feature discovery.Supervised Boltzmann machines. Bayesian equivalence between Hopfield retrieval and Boltzmann learning.
Chapter 7: A few remarks on unsupervised learning, “complex” patterns, deep learningUnsupervised Boltzmann machines.Non-Gaussian priors.Multilayered Boltzmann machines and deep learning.
Seminars: Numerical tools for machine learning; Non-mean-field neural networks; (Bio-)Logic gates; Maximum entropy approach, Hamilton-Jacobi techniques for mean-field models, …
Introduction
October ‘17
20 October ‘17
4
The “founding fathers”
Ivan Pavlov (1849-1936)Santiago R.Y. Cajal (1879-1936)
Donal Hebb (1904-1985)Alan L. Hodking (1914-1998)
Andrew Huxley (1917-2012)Marvin L. Minsky (1927-2016)
Seymour Papert (1928-2016)Frank Rosenblatt (1928-2017)
Pavlov’s experiment:storing correlations between stimuli
H o d g k i n – H u x l e y m o d e l , o r conductance-based model, is a mathematical model that describes how action potentials in neurons are initiated and propagated.
Rosenblatt’s perceptron: probabilistic model for information storage and organization in the brain
Minsky&Papert showed that a single-layer net can learn only linearly seperable problems5Geoffrey Hinton
John Hopfield
Giorgio Parisi
6
/150
Statistical-mechanics approach
non linear overpercolatedwith feed-back
cells (‘neurons’) in brains or in other nervous tissue electronic processors (or even software) in artificial systems inspired by biological neural networks
Principles behind information processing in complex networks of simple interacting decision-making units
7
/150
Artificial intelligence is intelligence exhibited by machines
It is built along humans’ cognition mechanisms
Learn Retrieve
Cognition = Learning + RetrievalStudy the two mechanisms separately, under adiabatic hp τR ~O(ms) τL ~O(day/week)
8
/150
Artificial intelligence is intelligence exhibited by machines
It is built along humans’ cognition mechanisms
Learn
Disciplines actually specialized in either learning or retrieval
Engineers: machine learning ➛ speech and object recognition, robot locomotion and computer vision Mathematicians and Theoretical Physicists: rigorous results concerning the retrieval of a machine’s stored data, under different conditions
Retrieve
9
D.J. Amit (1992) Modeling brain function, Cambridge University Press.H.C. Tuckwell (2005) Introduction to theoretical neurobiology, Cambridge University Press.F. Rosenblatt (1958) The perceptron: a probabilistic model for information storage and organization in the brain, Psychological Review.A.C.C. Coolen, R. Kühn, P. Sollich (2005) Theory of neural information processing systems, Oxford University Press.
/150
RetrievalThe network is asked to recognize previously learned input vectors, even in the case where some noise has been added.
10
/150
RetrievalThe network is asked to recognize previously learned input vectors, even in the case where some noise has been added.
CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart): challenge-response test used in computing to determine whether or not the user is human.
Examples
1 "To be or not to be, that is _____." 2 "I came, I saw, _____." Most readers will realize the missing information is in fact: 1 "To be or not to be, that is the question." 2 "I came, I saw, I conquered."
Associative memory
11
/150
- Content-addressable memory
In the von Neumann model of computation, an entry in memory is accessed only through its address, which is independent of the content in the memory. If a small error is made in calculating the address, a completely different item can be retrieived.
- Optimization: find a solution satisfying a set of constraints such that an obcective function is minimized or maximized.
Traveling Salesman Problem, that is an NP-complete problem
Associative memory or content-addressable memory can be accessed by their content. The content in the memory can be recalled even by a partial input or distorted content. Associative memroy is extremely desirable in building multimedia information databases.
12
/150
Learning
A machine learning algorithm is an algorithm that is able to learn from data
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks T, as measured by P, improves with experience E
To learn complex tasks we need “dynamic algorithms”
13
/150
Examples of tasks
- Classification (possibly with missing inputs): the program is asked to specify which of k categories some inputs belong to
the Willow Garage PR2 robot is able to act as a waiter that can recognize different kinds of drinks and deliver them to people on command
- Transcription: observe a relatively unstructured representation of some kind of data and transcribe it into discrete, textual form
a photograph containing an image of text is translated in the form of a sequence of characters (e.g., in ASCII or Unicode format).
Google Street View processes address numbers in this way
- Anomaly detection: the program sifts through a set of events or objects, and flags some of them as being unusual or atypical
credit card companies can prevent fraud by placing a hold on an account as card has been used for an uncharacteristic purchase (thief’s purchases
often come from a different probability distribution)
- Synthesis and sampling: the program is asked to generate new examples that are similar to those in the training data
video games can automatically generate textures for large objects or landscapes, rather than requiring an artist to manually label each pixel
/150
- Prediction/Forecasting: ogiven a set of n samples in a time series, the task is to predict the sample at some future time.
impact on decision-making in business, science and engineering. Stock market prediction and weather forecasting are typical applications
- Function approximation: the task is to find an estimate of an unknown function, given a set of input-output examples.
Inference is required whenever a model is at work
- Optimization: find a solution satisfying a set of constraints such that an obcective function is minimized or maximized.
Traveling Salesman Problem, that is an NP-complete problem
15
Chapter I Basics of equilibrium statistical mechanics
/15017
Thermodynamic equilibrium states
Description levelMacroscopicMicroscopic
Mesoscopic
average over proper degrees of freedom
coarse graining
Statistical Ensemble: thermodynamic state of a system described though a probability measure over the mesoscopic states of the system; the corresponding thermodynamic observables expressed through proper averages
T space of mesoscopic statesFor each state i∈T consider the energy Ei which, in general, depends on external parameters
phase space Γ
state i
Thermodynamic states (not necessarily of equilibrium) described by probability distribution ρ = {ρi}i∈T
over T.Being S the set of thermodynamic states,
ρ ∈ S ⇒ ρ ={ρi}i∈T : ρi ≥ 0, ∑i ρi =1.Defs.
Internal Energy
Entropy
Free Energy
U(E, ⇢) =X
i
⇢iEi
S(⇢) = �KX
i
⇢i log ⇢i
F (E, ⇢, T ) = U(E, ⇢)� TS(⇢) (strictly convex in S)
Equilibrium state or Boltzmann distribution: state where F reaches its min
⇢̄i =1
Ze�Ei/KT
Z =X
i
e�Ei/KT = e�F̄ (E,T )/KT
F̄ (E, T ) = inf⇢F (E, ⇢, T ) ⌘ F (E, ⇢̄, T )
partition function
Helmholtz free energy
Parameter T can be seen as the inverse of a Lagrangian multiplier: MaxEnt with U =Ū
/15019
Theorem.Hp: the dynamic evolution of the system is described by an ergodic, stationary Markov processThen: Second principle of thermodynamic (entropy is not decreasing) holds iff the Boltzmann state is the invariant state for the ergodic process.
Bridges micro and macroIts derivatives provide the average of thermodynamic observables
F = -T log(Z) = U - T S
The Curie-Weiss model
The model appeared for the first time in the physics literature as a model for ferromagnets (a ferromagnet is a material that acquires a macroscopic spontaneous magnetization at low temperature). In this context, the variables σi are called spins and their value represents the direction in which a localized magnetic moment (think of a tiny compass needle) is pointing. In some materials different magnetic moments like to point in the same direction (as people like to have similar opinions). Physicists want to understand whether this interaction might lead to a macroscopic magnetization (imbalance), or not.
Unordered stateNo magnetization Ordered state
Non null magnetization
20
H({�}, J, h) = �1
N
X
i,j2Vi⇠j
Jij�i�j �
X
i2V
hi�i
coupling matrix
external magnetic field
graphG = (V,E)
�i = ±1, i =, ..., |V | ⌘ N
J
h
Let us check it makes sense…- If Jij>0 (Jij<0) the configuration where i and j are aligned (misaligned) is energetically more
favorable- The i-th spin tends to get aligned with hi
- H is linearly extensive with the system volume
21
H({�}, J, h) = �1
N
X
i,j2Vi⇠j
Jij�i�j �
X
i2V
hi�i
coupling matrix
external magnetic field
graph
Hp: homogeneity H({�}, J, h) = �J
N
X
i,j2Vi⇠j
�i�j � hX
i2V
�i
H({�}, J, h) = �J
N
NX
i=1
NX
j<i
�i�j � h
NX
i=1
�i
= �J
N
NX
i,j=1
�i�j � h
NX
i=1
�i +O(1)
Hp: mean-field
G = (V,E)
�i = ±1, i =, ..., |V | ⌘ N
J
h
2
Rigorous solution- Saddle point- Interpolation- Cavity- Hamilton-Jacobi…
Phase transition, criticality, spontaneous symmetry breaking
Order parameter mN ({�}) = 1
N
NX
i=1
�i
MF models are models that lack any (finite-dimensional) geometrical structure. For instance, typically models on the complete graphs or on standard random graphs are mean field. On the other hand, models on (finite portions) of finite dimensional grids are not.
Mean field
23
Thermodynamic average
Intensive free energy
G({�}) ! hGi =P
{�} G({�})e��H({�}|J,h)
Z(�)
“mathematical pressure”↵N = ��fN =1
NlogZN
↵ = limN!1
↵N
Towards the rigorous solution…
1) Existence of the TDLαN is subadditive
N = N1 + N2, where systems 1 and 2 are non-interacting
m1 =1
N1
N1X
i=1
�i
m2 =1
N2
NX
i=N1+1
�i
) m =N1
Nm1 +
N2
Nm2 ) Nm2 N1m
21 +N2m
22
) ZN (�) ZN1(�) ZN2(�)
↵(�) = limN!1
1
NlogZN (�) = inf
N
1
NlogZN (�)
2) Evaluate the limitUpper and lower bound for α coincide
3) Find an explicit expression for the order parameter(s)24
(Fekete’s lemma)
(Guerra&Toninelli’s scheme)
Self-consistency equation hmi = tanh(�Jhmi+ �h)
25
in the presence of a field h
h=0As T/J>1 three intersections emerge, the equilibrium solution is the one which minimizes the free energy
26
t ⌘ T � Tc
Tc
hmi / h1/�, for T = Tc
� =@hmi@h
h=0 / |t|��
Critical exponent
�MFT = 1/2
�MFT = 3
�MFT = 1
hmi/ (�t)� T < Tc
= 0 T > Tc
Landau theory
↵(�, J, h) =1
�logZ(�, J, h) = log 2� 1
2�Jhm2i+ log cosh[�(Jhmi+ h)]
Even function in m in the absence of hQuadratic when T < Tc, Quartic as T → Tc+ ⟷ Criticality 27
�Jhm+(�J)2
2m2 +
(�h)2
2
↵(�, J, h) = a+ hm+ bm2 + cm4 + ... as m ⇡ 0
Fluctuation theory
wherem* = SC solutionmc = m*(hc, Jc, Tc)
Notice that Magnetization is self-averaging in the TDL and converges in distribution to a Dirac delta
mN → δ (m*)At the critical point fluctuations are not Gaussian
Breakdown of Central limit theorem, fluctuations are not of order √N but larger
The study of fluctuations should be regarderd as a finite-size correction to the evaluation of the equilibrium magnetization. We are then interested in the asymptotic distribution of the random variable yN defined as yN = ( ∑i σi/N - m*) Nγ , with γ >0 and to be chosen in such a way that y = lim N→∞ yN has a nontriviall probability distribution (that is, N-γ is the right scale to look at).See Ellis&Newman ’78, Ellis ‘88
MN �Nm⇤pN
D�! 1p2⇡�2
e�x2/(2�2) as T 6= Tc
MN �Nmc
N3/4
D�! Ae�Bx4/24 as T = Tc
Let us assume h=0 and initialize the system such that m>0
The system spontaneously recovers the information (+1, +1, …, +1)The CW model stores 1 bit of informationBy Shannon compression theorem this vector contains the minimum information given N neurons
Patterns where entries are extracted randomly, e.g., (+1, -1, …, +1) is a more challenging pattern of information
ξ = pattern of information of length Nentries ξi are extracted according to a given P(ξi)
Transform σi → σi’ = σi ξiNew order parameter is the Mattis magnetization
At T=0, equilibrium configuration σi = ξi (or gauged), ∀iStill, this is a modest improvement…
m1 =1
N
NX
i=1
⇠i�i
overlap between spin configuration and pattern ξ
hm1i = EP
{�} m1e��HMattis
P{�} e
��HMattis
hHMattisi = �N
2hm2
1i
HMattis = � 1
2N
NX
i,j=1
⇠i⇠j�i�j = �N
2m
21
29
CW as a trivial associative network → Mattis model