Mathematics for Artiﬁcial Intelligence - Elena Agliarielenaagliari.weebly.com/uploads/4/8/4/3/48439633/m4ai_1.pdfMathematics for Artiﬁcial Intelligence Reading Course Elena Agliari

Mathematics for Artificial Intelligence

Reading Course

Elena Agliari Dipartimento di Matematica Sapienza Università di Roma

(TENTATIVE) PLAN OF THE COURSE Introduction Chapter 1: Basics of statistical mechanics.

The Curie-Weiss modelChapter 2: Neural networks for associative memory and pattern recognition.Chapter 3: The Hopfield model

Hopfield model with low-load and solution via log-constrained entropy.Hopfield model with high-load and solution via stochastic stability.

Chapter 4: Beyond the Hebbian paradigmaChapter 5: A gentle introduction to machine learning

Maximum likelihoodRosenblatt and Minsky&Papert perceptrons.

Chapter 6: Neural networks for statistical learning and feature discovery.Supervised Boltzmann machines. Bayesian equivalence between Hopfield retrieval and Boltzmann learning.

Chapter 7: A few remarks on unsupervised learning, “complex” patterns, deep learningUnsupervised Boltzmann machines.Non-Gaussian priors.Multilayered Boltzmann machines and deep learning.

Seminars: Numerical tools for machine learning; Non-mean-field neural networks; (Bio-)Logic gates; Maximum entropy approach, Hamilton-Jacobi techniques for mean-field models, …

Introduction

October ‘17

20 October ‘17

4

The “founding fathers”

Ivan Pavlov (1849-1936)Santiago R.Y. Cajal (1879-1936)

Donal Hebb (1904-1985)Alan L. Hodking (1914-1998)

Andrew Huxley (1917-2012)Marvin L. Minsky (1927-2016)

Seymour Papert (1928-2016)Frank Rosenblatt (1928-2017)

Pavlov’s experiment:storing correlations between stimuli

H o d g k i n – H u x l e y m o d e l , o r conductance-based model, is a mathematical model that describes how action potentials in neurons are initiated and propagated.

Rosenblatt’s perceptron: probabilistic model for information storage and organization in the brain

Minsky&Papert showed that a single-layer net can learn only linearly seperable problems5Geoffrey Hinton

John Hopfield

Giorgio Parisi

https://en.wikipedia.org/wiki/Mathematical_model

https://en.wikipedia.org/wiki/Action_potential

https://en.wikipedia.org/wiki/Neuron

6

/150

Statistical-mechanics approach

non linear overpercolatedwith feed-back

cells (‘neurons’) in brains or in other nervous tissue electronic processors (or even software) in artificial systems inspired by biological neural networks

Principles behind information processing in complex networks of simple interacting decision-making units

7

/150

Artificial intelligence is intelligence exhibited by machines

It is built along humans’ cognition mechanisms

Learn Retrieve

Cognition = Learning + RetrievalStudy the two mechanisms separately, under adiabatic hp τR ~O(ms) τL ~O(day/week)

8

/150

Artificial intelligence is intelligence exhibited by machines

It is built along humans’ cognition mechanisms

Learn

Disciplines actually specialized in either learning or retrieval

Engineers: machine learning ➛ speech and object recognition, robot locomotion and computer vision Mathematicians and Theoretical Physicists: rigorous results concerning the retrieval of a machine’s stored data, under different conditions

Retrieve

9

D.J. Amit (1992) Modeling brain function, Cambridge University Press.H.C. Tuckwell (2005) Introduction to theoretical neurobiology, Cambridge University Press.F. Rosenblatt (1958) The perceptron: a probabilistic model for information storage and organization in the brain, Psychological Review.A.C.C. Coolen, R. Kühn, P. Sollich (2005) Theory of neural information processing systems, Oxford University Press.

/150

RetrievalThe network is asked to recognize previously learned input vectors, even in the case where some noise has been added.

10

/150

RetrievalThe network is asked to recognize previously learned input vectors, even in the case where some noise has been added.

CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart): challenge-response test used in computing to determine whether or not the user is human.

Examples

1 "To be or not to be, that is _____." 2 "I came, I saw, _____." Most readers will realize the missing information is in fact: 1 "To be or not to be, that is the question." 2 "I came, I saw, I conquered."

Associative memory

11

https://en.wikipedia.org/wiki/Automatic_test_equipment

https://en.wikipedia.org/wiki/Turing_test

https://en.wikipedia.org/wiki/Challenge-response_authentication

https://en.wikipedia.org/wiki/Challenge-response_authentication

https://en.wikipedia.org/wiki/Computing

/150

- Content-addressable memory

In the von Neumann model of computation, an entry in memory is accessed only through its address, which is independent of the content in the memory. If a small error is made in calculating the address, a completely different item can be retrieived.

- Optimization: find a solution satisfying a set of constraints such that an obcective function is minimized or maximized.

Traveling Salesman Problem, that is an NP-complete problem

Associative memory or content-addressable memory can be accessed by their content. The content in the memory can be recalled even by a partial input or distorted content. Associative memroy is extremely desirable in building multimedia information databases.

12

/150

Learning

A machine learning algorithm is an algorithm that is able to learn from data

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks T, as measured by P, improves with experience E

To learn complex tasks we need “dynamic algorithms”

13

/150

Examples of tasks

- Classification (possibly with missing inputs): the program is asked to specify which of k categories some inputs belong to

the Willow Garage PR2 robot is able to act as a waiter that can recognize different kinds of drinks and deliver them to people on command

- Transcription: observe a relatively unstructured representation of some kind of data and transcribe it into discrete, textual form

a photograph containing an image of text is translated in the form of a sequence of characters (e.g., in ASCII or Unicode format).

Google Street View processes address numbers in this way

- Anomaly detection: the program sifts through a set of events or objects, and flags some of them as being unusual or atypical

credit card companies can prevent fraud by placing a hold on an account as card has been used for an uncharacteristic purchase (thief’s purchases

often come from a different probability distribution)

- Synthesis and sampling: the program is asked to generate new examples that are similar to those in the training data

video games can automatically generate textures for large objects or landscapes, rather than requiring an artist to manually label each pixel

/150

- Prediction/Forecasting: ogiven a set of n samples in a time series, the task is to predict the sample at some future time.

impact on decision-making in business, science and engineering. Stock market prediction and weather forecasting are typical applications

- Function approximation: the task is to find an estimate of an unknown function, given a set of input-output examples.

Inference is required whenever a model is at work

- Optimization: find a solution satisfying a set of constraints such that an obcective function is minimized or maximized.

Traveling Salesman Problem, that is an NP-complete problem

15

Chapter I Basics of equilibrium statistical mechanics

/15017

Thermodynamic equilibrium states

Description levelMacroscopicMicroscopic

Mesoscopic

average over proper degrees of freedom

coarse graining

Statistical Ensemble: thermodynamic state of a system described though a probability measure over the mesoscopic states of the system; the corresponding thermodynamic observables expressed through proper averages

T space of mesoscopic statesFor each state i∈T consider the energy Ei which, in general, depends on external parameters

phase space Γ

state i

Thermodynamic states (not necessarily of equilibrium) described by probability distribution ρ = {ρi}i∈T

over T.Being S the set of thermodynamic states,

ρ ∈ S ⇒ ρ ={ρi}i∈T : ρi ≥ 0, ∑i ρi =1.Defs.

Internal Energy

Entropy

Free Energy

U(E, ⇢) =X

i

⇢iEi

S(⇢) = �KX

i

⇢i log ⇢i

F (E, ⇢, T ) = U(E, ⇢)� TS(⇢) (strictly convex in S)

Equilibrium state or Boltzmann distribution: state where F reaches its min

⇢̄i =1

Ze�Ei/KT

Z =X

i

e�Ei/KT = e�F̄ (E,T )/KT

F̄ (E, T ) = inf⇢F (E, ⇢, T ) ⌘ F (E, ⇢̄, T )

partition function

Helmholtz free energy

Parameter T can be seen as the inverse of a Lagrangian multiplier: MaxEnt with U =Ū

/15019

Theorem.Hp: the dynamic evolution of the system is described by an ergodic, stationary Markov processThen: Second principle of thermodynamic (entropy is not decreasing) holds iff the Boltzmann state is the invariant state for the ergodic process.

Bridges micro and macroIts derivatives provide the average of thermodynamic observables

F = -T log(Z) = U - T S

The Curie-Weiss model

The model appeared for the first time in the physics literature as a model for ferromagnets (a ferromagnet is a material that acquires a macroscopic spontaneous magnetization at low temperature). In this context, the variables σi are called spins and their value represents the direction in which a localized magnetic moment (think of a tiny compass needle) is pointing. In some materials different magnetic moments like to point in the same direction (as people like to have similar opinions). Physicists want to understand whether this interaction might lead to a macroscopic magnetization (imbalance), or not.

Unordered stateNo magnetization Ordered state

Non null magnetization

20

H({�}, J, h) = �1

N

X

i,j2Vi⇠j

Jij�i�j �

X

i2V

hi�i

coupling matrix

external magnetic field

graphG = (V,E)

�i = ±1, i =, ..., |V | ⌘ N

J

h

Let us check it makes sense…- If Jij>0 (Jij<0) the configuration where i and j are aligned (misaligned) is energetically more

favorable- The i-th spin tends to get aligned with hi

- H is linearly extensive with the system volume

21

H({�}, J, h) = �1

N

X

i,j2Vi⇠j

Jij�i�j �

X

i2V

hi�i

coupling matrix

external magnetic field

graph

Hp: homogeneity H({�}, J, h) = �J

N

X

i,j2Vi⇠j

�i�j � hX

i2V

�i

H({�}, J, h) = �J

N

NX

i=1

NX

j<i

�i�j � h

NX

i=1

�i

= �J

N

NX

i,j=1

�i�j � h

NX

i=1

�i +O(1)

Hp: mean-field

G = (V,E)

�i = ±1, i =, ..., |V | ⌘ N

J

h

2

Rigorous solution- Saddle point- Interpolation- Cavity- Hamilton-Jacobi…

Phase transition, criticality, spontaneous symmetry breaking

Order parameter mN ({�}) = 1

N

NX

i=1

�i

MF models are models that lack any (finite-dimensional) geometrical structure. For instance, typically models on the complete graphs or on standard random graphs are mean field. On the other hand, models on (finite portions) of finite dimensional grids are not.

Mean field

23

Thermodynamic average

Intensive free energy

G({�}) ! hGi =P

{�} G({�})e��H({�}|J,h)

Z(�)

“mathematical pressure”↵N = ��fN =1

NlogZN

↵ = limN!1

↵N

Towards the rigorous solution…

1) Existence of the TDLαN is subadditive

N = N1 + N2, where systems 1 and 2 are non-interacting

m1 =1

N1

N1X

i=1

�i

m2 =1

N2

NX

i=N1+1

�i

) m =N1

Nm1 +

N2

Nm2 ) Nm2 N1m

21 +N2m

22

) ZN (�) ZN1(�) ZN2(�)

↵(�) = limN!1

1

NlogZN (�) = inf

N

1

NlogZN (�)

2) Evaluate the limitUpper and lower bound for α coincide

3) Find an explicit expression for the order parameter(s)24

(Fekete’s lemma)

(Guerra&Toninelli’s scheme)

Self-consistency equation hmi = tanh(�Jhmi+ �h)

25

in the presence of a field h

h=0As T/J>1 three intersections emerge, the equilibrium solution is the one which minimizes the free energy

26

t ⌘ T � Tc

Tc

hmi / h1/�, for T = Tc

� =@hmi@h

h=0 / |t|��

Critical exponent

�MFT = 1/2

�MFT = 3

�MFT = 1

hmi/ (�t)� T < Tc

= 0 T > Tc

Landau theory

↵(�, J, h) =1

�logZ(�, J, h) = log 2� 1

2�Jhm2i+ log cosh[�(Jhmi+ h)]

Even function in m in the absence of hQuadratic when T < Tc, Quartic as T → Tc+ ⟷ Criticality 27

�Jhm+(�J)2

2m2 +

(�h)2

2

↵(�, J, h) = a+ hm+ bm2 + cm4 + ... as m ⇡ 0

Fluctuation theory

wherem* = SC solutionmc = m*(hc, Jc, Tc)

Notice that Magnetization is self-averaging in the TDL and converges in distribution to a Dirac delta

mN → δ (m*)At the critical point fluctuations are not Gaussian

Breakdown of Central limit theorem, fluctuations are not of order √N but larger

The study of fluctuations should be regarderd as a finite-size correction to the evaluation of the equilibrium magnetization. We are then interested in the asymptotic distribution of the random variable yN defined as yN = ( ∑i σi/N - m*) Nγ , with γ >0 and to be chosen in such a way that y = lim N→∞ yN has a nontriviall probability distribution (that is, N-γ is the right scale to look at).See Ellis&Newman ’78, Ellis ‘88

MN �Nm⇤pN

D�! 1p2⇡�2

e�x2/(2�2) as T 6= Tc

MN �Nmc

N3/4

D�! Ae�Bx4/24 as T = Tc

Let us assume h=0 and initialize the system such that m>0

The system spontaneously recovers the information (+1, +1, …, +1)The CW model stores 1 bit of informationBy Shannon compression theorem this vector contains the minimum information given N neurons

Patterns where entries are extracted randomly, e.g., (+1, -1, …, +1) is a more challenging pattern of information

ξ = pattern of information of length Nentries ξi are extracted according to a given P(ξi)

Transform σi → σi’ = σi ξiNew order parameter is the Mattis magnetization

At T=0, equilibrium configuration σi = ξi (or gauged), ∀iStill, this is a modest improvement…

m1 =1

N

NX

i=1

⇠i�i

overlap between spin configuration and pattern ξ

hm1i = EP

{�} m1e��HMattis

P{�} e

��HMattis

hHMattisi = �N

2hm2

1i

HMattis = � 1

2N

NX

i,j=1

⇠i⇠j�i�j = �N

2m

21

29

CW as a trivial associative network → Mattis model

Documents

Mathematics for Artiﬁcial Intelligence - Elena Agliarielenaagliari.weebly.com/uploads/4/8/4/3/48439633/m4ai_1.pdfMathematics for Artiﬁcial Intelligence Reading Course Elena Agliari