29
Gradient estimation for attractor networks Thomas Flynn

Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

Gradient estimation for attractor networks

Thomas Flynn

Page 2: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

Abstract

We study several types of neural networks that involve feedback connections. These modelslead to interesting problems in optimization and may also be useful in applications. In all ofthese networks, which include deterministic and stochastic networks, on both continuousand discrete state spaces, the gradient estimation is a dynamic process involving feed-back, just like the underlying networks. First we consider optimization of deterministicattractor networks, based on the forward and adjoint sensitivity analysis methods. Then weconsider derivative estimation in a stochastic variant of this network, and propose to portthe forward sensitivity analysis method to this setting. Thirdly, we consider a stochasticnetwork on a discrete space, known as the Little model. Gradient estimation in specialcases of this model, such as the Boltzmann machine and sigmoid belief networks, hasbeen studied, based on closed-form solutions for the resulting probability distributions. Inthe general case, one only has the Markov kernels that specify the short term behavior ofthe model. To enable gradient-based optimization in this setting, we propose to calculatesearch directions by combining features of simultaneous perturbation analysis and measurevalued differentiation. Our aim in studying these models is to expand the set of tools thatare available for experimentation on problems of interest in machine learning. In addition tothe theoretical results, we are also interested in applying these models to relevant problemsin machine learning, such as image classification.

Page 3: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Deterministic Attractor networks . . . . . . . . . . . . . . . . . . . . . . . 52.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Research question . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Stochastic Attractor Networks . . . . . . . . . . . . . . . . . . . . . . . . 123.1 Research question . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Stochastic contraction . . . . . . . . . . . . . . . . . . . . . . . . 153.3 Approach to the problem . . . . . . . . . . . . . . . . . . . . . . . 16

4 Discrete models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2 Stationary differentiability using MVD . . . . . . . . . . . . . . . 224.3 Research question . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Conclusion and Time line . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2

Page 4: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

1 IntroductionMany machine learning problems are approached using gradient based optimization. Ateach step of the algorithm, the optimization program calculates the derivative of an objec-tive function with respect to parameters of the model, in order to determine what directionto move in. In many cases these derivatives cannot be calculated exactly. Difficulties ingradient estimation happen when dealing with stochastic models, and also with determinis-tic models that involve feedback. One approach that can be taken with models that involvefeedback is to run the gradient estimation and optimization processes simultaneously. Theidea is to use an iterative algorithm for computing the gradient, and carry over the state ofthe estimation procedure after each parameter update, to avoid starting from scratch eachtime.

This type of gradient estimation can be used for sensitivity analysis in attractor net-works. This type of neural network, which we formally define in Section 2, is in someways similar to the usual feed-forward networks. They can be used as input-output maps,sending, for example, an input image to an vector of confidence scores for different cate-gories that the image could belong to. The main difference is that an attractor network hasa dynamic internal state, enabled by allowing feed-back connections among units. Feed-forward networks, on the other hand, exhibit trivial dynamical behavior; they always reacha steady-state in a finite number of steps. The capacity for interesting dynamical behav-ior may translate to information processing capacity. However, training and maintainingthe stability of these networks seems difficult. The dynamical nature means that one can-not compute the derivatives of interest easily, as one can in feed-forward networks. Onecan construct an optimization algorithm for attractor networks using a dynamic variant ofback-propagation based algorithms. We propose to address the problems of how to tunethe algorithm to guarantee a function decrease, and to obtain a long-term guarantee on theoptimization algorithm. Some other works (See Figure 1) also considered these models, butthe results they obtained for gradient estimation and optimization were mostly heuristic, orwere asymptotic. In this setting we seek results that concern finite-step sizes and we alsoto express the results in terms of available model information.

The attractor networks can be generalized by making them stochastic. For instance,one can have stochastic connections between the nodes of the network, so that a node doesnot always communicate with the same set of neighbors. This could model for instancelarge, distributed, neural networks whose units can not communicate perfectly due to com-munication delays or synchronization issues. Having stochastic connections has also beenconjectured to help with over fitting. In Section 3, we propose to extend the gradient esti-mation algorithm from deterministic attractor networks to this setting. As shown in Figure1, some other authors considered restricted cases of such models that have acyclic connec-tivity.

We then turn our attention in Section 4 to networks that operate on a discrete state space.Several stochastic neural networks on discrete state spaces have been studied, and theirgradient estimation procedures are based on having closed form solutions for the resulting

3

Page 5: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

?

[11][9, 10],this work

gene

ral acyclic

cont

inuo

us

discrete

[8]

[2, 3][1]

asyn

chro

nous synchronous

[7, 4],this work?

asyn

chro

nous

synchronous

general

symm

etric

acyclic

[5, 6]

this work?

asyn

chro

nous

synchronous

gene

ral acyclic

continuous space discrete space

stochastic deterministic

Figure 1: Gradient estimation and optimization in different dynamical settings for neuralnetworks. The first level splits into networks that are stochastic or deterministic. At thenext level the split is whether the state space of the network is discrete or continuous. Atthe third level we differentiate based on connectivity constraints; whether the network isacyclic, has symmetric connections, or allows general connectivity. At the fifth level thesplit is based on the order of updating the nodes - whether they are they updated all at once(synchronous) or one at a time, in an asynchronous manner. This work considers gradientestimation for three types of networks. A question mark means the author is unaware ofany works considering gradient estimation in models with those properties.

probability distributions. The works [1, 2, 3] (See Fig. 1) depend on constraints on networkconnectivity - for instance symmetry, or prohibiting cycles. In the case of networks wherecycles are allowed, the work [4] considered gradient estimation in the finite horizon setting.Our interest is in the long-term average cost in networks that have general connectivity,where only knowledge of the transition probabilities is available. Methods such as forwardsensitivity analysis cannot be used in this case, as they rely on the differential structureof the underlying state space. Instead, we propose an algorithm that computes descentdirections based on simultaneous perturbation analysis and measure valued differentiation(MVD).

Along the way we recall a number of relevant mathematical tools. In the determinis-tic case, to analyze convergence of the forward sensitivity system we use the contractionmapping theorem and a theorem on hierarchies of contracting systems. For considering thelong term behavior of stochastic networks, we use a condition based on contraction in theWasserstein distance. When considering sensitivity analysis for discrete networks we willuse results on MVD for Markov chains to establish differentiability of stationary costs.

4

Page 6: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

1.1 Notations• n - dimensionality of the state space of a model. In a network based model, this will

be the number of nodes in the network.

• m - dimensionality of the parameter space of a model.

• x - variable for the state of a neural network or other dynamic system.

• wi,j - weight from node j into node i

• bi - bias at node i

• θ - parameter vector. In a neural network model, θ is a pair (w, b), consisting of aweight matrix and bias vector.

• ∂g∂y

(y) - for a function g : Rn → Rm, this is them×nmatrix with entries[∂g∂y

(y)]i,j

=

∂gi∂yj

(y).

• t - iteration number of algorithm.

• θ(1), θ(2), . . . , θ(t), . . . - sequence of parameters generated by optimization algo-rithm.

• µ - a probability measure representing an external noise source.

• πθ - probability measure depending on a parameter θ.

• P (x,A) - the probability a Markov kernel P assigns to set A from the state x.

• When the state space is discrete, we will often abuse notation and write P (x, y)instead of P (x, {y}).

• ν(e) - expected value of random variable e under measure ν; ν(e) =∫Xe(x)dν(x).

• δx - point mass centered at x; the measure δx(A) ={

1 if x ∈ A, 0 otherwise.}

2 Deterministic Attractor networksThe work of [11] generated much interest in gradient-based training of feed-forward net-works, and shortly after there appeared works studying all variety of neural networks. Thetype of networks that we describe in this section, attractor networks, seem to first have beendescribed in [9], [10].

Attractor networks form a class of neural networks that includes feed-forward networks,but also allows for feed-back connections. These are interesting because they can possiblysupport non-trivial dynamic behavior. It seems that allowing feed-back connections should

5

Page 7: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

increase the computational power of neural networks - perhaps allowing a network withfewer nodes and connections to replace a very big feed forward network. We review twoknown algorithms for computing derivatives in these networks, called adjoint sensitivityanalysis and forward sensitivity analysis. Our proposed work is to analyze an optimizationalgorithm based on these gradient estimators. The theoretical analyses in previous workswas very limited.

2.1 BackgroundWe now define an attractor network. The state space is X = Rn and there is a set of edgesthat determines the connectivity between nodes. We let x = (x1, . . . , xn) denote a state ofthe neural network. The weights w ∈ Rn×n and biases b ∈ Rn are the parameters of thenetwork. The number wi,j is the weight into node i from node j. A value of wi,j = 0 meansthere is no connection to node i from node j. We let Θ = Rn × Rn×n represent the jointparameter space.

The following function f : X ×Θ×X → X gives the next state of the network giventhe current state, parameters, and input:

fi(x, (w, b), u) = σ

(n∑j=1

wi,jxj + bi + ui

), i = 1, 2, . . . , n (1)

That is, the state of node i at time t + 1 is determined by the input ui at that node, and thestates of its neighbors at time t. Typical choices for the function σ are the logistic functionσ(x) = 1

1+exp(−x)or the hyperbolic tangent σ(x) = 2

1+exp(−2x)− 1.

By iterating (1) starting from an initial point x(0), one obtains a sequence of networkstates x(1), x(2), . . .where x(t+1) = f(x(t), θ, u). Alternatively, we may write x(t+1) =f t+1(x(0), θ, u). A network has a globally attractive fixed-point when there is a state x∗ thatis a fixed-point for f , meaning

f(x∗, θ, u) = x∗,

and this fixed-point is globally attractive, meaning

limt→∞

f t(x(0), θ, u) = x∗

for any initial point x0. We will also use the name ’fixed-point networks’ to refer to attrac-tor networks. In general, this fixed-point will depend on the parameters θ = (w, b) and u,and to make this explicit we will write x∗(θ, u). The question of whether a network admitsa globally attractive fixed-point for a given value of the parameters w, b seems to be dif-ficult; for one type of attractor network known as the Hopfield network, various hardnessresults have been obtained for this question [12]. There are several known conditions, how-ever. For instance, it occurs when the weights of the network are “small” in some sense.Formally, for any norm ‖ · ‖ on Rn, there will be contraction when

supx∈X

∥∥∥∥∂f∂x (x)

∥∥∥∥ < 1 (2)

6

Page 8: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

If we specialize to the form of f in Equation 1, we can get a more specific condition. If thenorm ‖·‖ is a so-called absolute norm, which means that ‖(u1, . . . , un)‖ = ‖(|u1|, . . . , |un|)‖,then contraction will occur when

‖w‖ ≤ 1

‖σ′′‖∞

Norms that are absolute include the norms ‖ · ‖p for p ∈ {1, 2, . . . ,∞}. When σ is thelogistic function, ‖σ′′‖∞ < 4.

Let e : X → R be a loss function. The problem is to minimize the loss at the fixed-point:

minθJ(θ) (3)

whereJ(θ) = e(x∗(θ, u)).

From now on we will consider the input u as fixed and will drop it from the notation. For theclass of feed-back networks that converge to a unique equilibrium, the basic methodologythat is used in feed-forward networks can still be applied. For these networks the derivativesof interest can be computed using a protocol in which information is transmitted along theexisting links of the network, and on backwards connections, with a dynamical variant ofthe backpropagation algorithm. Some early works that considered these networks were[13], [10], [14], [9], [15].

We derive the optimization algorithm that we are interested in. The exposition is some-what similar to that in the works [16], [17]. The differentiability of J follows by the implicitfunction theorem and the chain rule. That is, starting from the equation

x∗(θ) = f(x∗(θ), θ) (4)

and using the contractivity and differentiability properties of f , one can conclude that x∗(θ)is differentiable. Then as long as e is differentiable we can apply the chain rule to get thatJ is differentiable. Using this and the formulas provided by the implicit function theorem,one obtains that

∂J

∂θ(θ) = A(θ)B(θ)C(θ) (5)

where

A(θ) =∂e

∂x(x∗(θ)), B(θ) =

(I − ∂f

∂x(x∗(θ), θ)

)−1

,

C(θ) =∂f

∂θ(x∗(θ), θ)

From this formula, we can see two challenges to computing, or even approximating, thederivative. The first is that these terms involve x∗(θ), which can only be approximated byiteration. Secondly, they involve the solution of linear systems (one can choose betweeneitherA(θ)B(θ) orB(θ)C(θ)). Below we describe two iterative algorithms that can addressthese problems.

7

Page 9: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

We can calculate the term A(θ)B(θ)C(θ) by computing A(θ)B(θ) and then post mul-tiplying by C(θ), or we can compute BC and premultiply by A. In each case, one canuse iterative solver to jointly solve the fixed-point equation (4) and deal with the matrixinverse. These approaches are referred to as adjoint sensitivity analysis and forward sen-sitivity analysis respectively. The derivations are somewhat symmetric; we focus on theadjoint method in what follows. One early work to mention this method of sensitivityanalysis, for the finite time case, was [18].

Let the space Z be Rn × Rn, and define the map TAdj : Z ×Θ→ Z as

TAdj((x, y), θ) =

(f(x, θ), y

∂f

∂x(x, θ) +

∂e

∂x(x)

)(6)

Assuming that TAdj possesses a fixed-point z∗(θ) = (x∗(θ), y∗(θ)), it is easy to verify thatx∗(θ) is a fixed-point for f and

y∗(θ) = A(θ)B(θ)

Therefore, if we could obtain the fixed-point z∗(θ) we could easily compute the gradient(5), since

∂J

∂θ(θ) = G(z∗(θ), θ)

whereG((x, y), θ) = y ∂f

∂θ(x, θ) (7)

This map TAdj is essentially doing the same type of gradient estimation as in the back-propagation procedure for neural networks. In the case that the network has no cycles, thenthe gradient estimation converges in a finite number of steps. For example, if f describes afeed-forward network with k layers then a fixed-point of TAdj will be reached by iteratingfor k steps. If there are cycles, then under certain contraction assumptions, the operatorTAdj also satisfies this contraction property. This can be verified using the condition onthe derivative in Inequality 2, for an appropriate choice of norm on the space Z . This isdiscussed in [14] and other works concerning attractor networks.

If TAdj defines a globally attractive process on Z , this gives an iterative method toestimate the gradient: Iterate TAdj enough times starting from an arbitrary point (x0, y0) toobtain point (xM , yM) close to (x∗(θ), y∗(θ)), and then form the estimate G((xM , yM), θ).By continuity properties of f , it should be that

G((xM , yM), θ) ≈ G(z∗(θ), θ)

where the quantity on the right which is the true gradient. The pseudocode for this proce-

8

Page 10: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

dure (termed adjoint sensitivity analysis) procedure in Algorithm 1.Algorithm 1: Deterministic Adjoint sensitivity analysis

Define TAdj : Rn × Rn → Rn × Rn as

TAdj((x, y), θ) =

(f(x, θ), y

∂f

∂x(x, θ) +

∂e

∂x(x)

)for t = 0, 1, . . . ,M − 1 do

(x(t+ 1), y(t+ 1)) = TAdj((x(t), y(t)), θ)end

- Set ∆Adj = y(M)∂f

∂θ(x(M), θ)

return ∆Adj

The output of Algorithm 1 is the gradient estimate ∆Adj . It has the property that

∆Adj → ∂J∂θ

(θ) as M →∞,

as we have explained above.By a very similar argument, one justifies the forward sensitivity analysis procedure:

Algorithm 2: Deterministic Forward sensitivity analysisDefine T Fwd : Rn × Rn×m ×Θ→ Rn × Rn×m as

T Fwd((x, u), θ) =

(f(x, θ),

∂f

∂x(x, θ)u+

∂f

∂θ(x, θ)

)for t = 0, 1, . . . ,M − 1 do

(x(t+ 1), u(t+ 1)) = T Fwd((x(t), u(t)), θ)end

- Set ∆Fwd =∂e

∂x(x(M))u(M)

return ∆Fwd

Just like adjoint sensitivity analysis, the output of Algorithm 2, has the property

∆Fwd → ∂J∂θ

(θ) as M →∞.

The difference is that T Fwd has as a fixed-point the pair (x∗(θ), u∗(θ)), where

u∗(θ) = B(θ)C(θ).

A very important point is that the map TAdj is operates in the space Rn × Rn, whilethe parameter θ is in Rm. It means that the computationally intensive “dynamical” partof gradient estimation happens on a state space that is of dimensionality 2n. The forwardsensitivity analysis involves the state space Rn × Rn×m, which is of dimensionality (n +

9

Page 11: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

1)m. By a crude criteria that equates “small dimensionality of a state space” with “efficientalgorithm”, the adjoint sensitivity analysis has clear advantages. In neural networks forinstance, the computation of an arbitrary entry of ∂f

∂x, ∂f∂θ

or ∂e∂x

are roughly the same, andadjoint sensitivity analysis (which the back-propagation algorithm is based on) is the clearchoice. In practice, the comparison might be more subtle. In any case, in this setting ofdeterministic contracting systems on a continuous state space, there is a choice between twoalternatives. It is of great interest to know whether algorithms with these properties can berealized in other contexts, such as stochastic systems on continuous space, or systems ondiscrete spaces.

We turn these gradient estimation procedures into optimization algorithms by runningthe gradient estimation and parameter update processes in parallel. The first, presented inAlgorithm 3, is based on forward sensitivity analysis:

Algorithm 3: Optimization using Forward sensitivity analysisfor t = 0, 1, . . . , do

(x(t+ 1), u(t+ 1)) = T Fwd((x(t), u(t)), θ(t))θ(t+ 1) = θ(t)− ε ∂e

∂x(x(t+ 1))u(t+ 1)

endThe second, Algorithm 4 is based on the adjoint sensitivity analysis.

Algorithm 4: Optimization using Adjoint sensitivity analysisfor t = 0, 1, . . . , do

(x(t+ 1), y(t+ 1)) = TAdj((x(t), y(t)), θ(t))

θ(t+ 1) = θ(t)− εy(t+ 1)∂f∂θ

(x(t+ 1), θ(t))

endThese algorithms, which couple gradient estimation with optimization are well-known

also in design optimization, where it is used in aerospace applications [19], [20], [21]. Inthat case, the gradient is difficult to compute because it involves the numerical solution ofa PDE, and the TAdj or T Fwd represents some kind numerical solver with a contractionproperty.

2.2 Research questionThe issue in analyzing either of the procedures is to choose the step size, and to determinewhat conditions, if any, we must have on the initial state. We will focus on Algorithm 4 inthe rest of this section, but the discussion applies equally to Algorithm 3. In this case theinitial state is (x0, y0, θ0). There are two considerations when choosing the step size ε. Thefirst is that, given a descent direction, only values that are sufficiently small will guaranteea function decrease. This is an important consideration in any gradient based algorithm.The second issue is related to the dynamic nature of the gradient estimation. If we startout with a good gradient estimate and ε is small, then the parameter hardly changes, andperhaps at the next iteration only a single step of gradient estimation will suffice to get a

10

Page 12: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

good search direction, since by continuity (of both the objective function and the gradientestimation process), we will be starting from a good approximation.

The type of theorem we would like to prove about the algorithm is as follows:

Theorem 2.1. Under some reasonable assumptions, there is a choice of constants ε, δdepending on available problem information, so that if z(0) = (x(0), y(0)) satisfies

‖z(0)− z∗(0)‖ < δ (8)

then the sequence of θ(n) generated by Algorithm 4, starting from the point(x(0), y(0), θ(0)) and using the step size ε, satisfies these properties:

i. (e ◦ x∗)(θ(t+ 1)) < (e ◦ x∗)(θ(t)),

ii. ∂∂θ

(e ◦ x∗)(θ(t))→ 0 as t→∞.

The criteria (8) means that the initial gradient estimate is accurate. This can be achievedby running the gradient estimation algorithm for a number of iterations before startingthe parameter adaptation process. The property (i) is just that the objective function isdecreasing at each step. The property (ii) states that the magnitude of the gradient tends tozero. Essentially, these two conditions state that the function values are decreasing and arenot decreasing too slowly.

We give a few remarks on the challenges in proving Theorem 2.1. First, we will spec-ify conditions that guarantee the optimization problem is well-defined. This involes con-traction and differentiability conditions on the underlying system. Then, to show that thegradient estimation works, we should verify our claim that the procedure map TAdj is acontraction mapping and converges to a unique fixed-point. When this estimation proce-dure is used inside an optimization algorithm (Algorithm 4), the exact gradients will not beused, so we will find a condition on the error in the gradient that guarantees convergence,and then find out how to tune the gradient procedure to maintain this condition duringoptimization.

There are a number of relevant works that attempt to analyze adjoint-based optimizationprocedures. The work of [22] analyzed a continuous time-version of the algorithm and ob-tained local convergence results using methods for singularly perturbed systems [23]. Theresults were local in the sense that they assume optimization begins near an attracting localminimum. They were also not constructive, meaning they didn’t quantify the requirementson the algorithm, such as time-scale parameters. In [24] the same authors considered thealgorithm in the discrete time setting, but did not pursue a convergence analysis.

A related set of works considers Hebbian learning in neural networks. This a type oflearning process for a neural network that is not explicitly gradient based, instead beingmotivated by the neuroscientific theory of Hebbian learning. The similarity is that bothinvolve simultaneous process of adaptation and underlying network dynamics, and the needto consider the relative rate of the two activities. Convergence of a continuous time Hebbianlearning process was discuss in [25]. The work of [26] also considered convergence of thistype of algorithm, again using the singular perturbation methods of [23].

11

Page 13: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

As mentioned above, algorithms that involve joint adjoint-based derivative estimationand optimization are used in other optimization fields as well. In PDE-constrained opti-mization it is called the one-shot approach [27, 21]. However, the results in these worksalso assume that the algorithm starts near a global minimum.

There are a number of applications that could be of interest. One idea is inspired byreservoir networks [28]. We can consider a hierarchical network, the first component beinga random attractor network, and the second being a regular feed-forward layer. We wouldkeep the random weights fixed, but only optimize the output weights. The above resultsrequire a uniform contraction property (that is, the rate of convergence of the system shouldstay constant as optimization proceeds). This is accommodated by keeping the feed-backweights fixed.

3 Stochastic Attractor NetworksThe neural networks we described in the previous section have been deterministic. To eachinput they associate one output. There are a number of situations when one may be inter-ested in stochastic networks. For instance, if one is dealing with a very large network thathas to be spread across multiple computers, this can cause random delays or synchroniza-tion issues. We can model this with a neural network that has noisy connections, meaningthe set of neighbors of each node is random at each time step. Additionally, a noisy net-work can represent a one-to-many mapping that could be useful when there are multipleinterpretations to a given image, for instance. In this section we describe a class of stochas-tic networks with feedback, and consider the corresponding optimization problem. Whenthe network is well behaved, the process will be ergodic and the problem is to optimize thelong-term behavior. An interesting question is whether the gradient estimation procedureswe have described, the adjoint and forward sensitivity analysis (Algorithms 1 and 2), canbe applied in this setting. We propose to study this question, to derive a useful gradientestimator for this model.

As in the deterministic attractor networks, there are three issues that must be addressedin a comprehensive treatment of these models. Firstly, we specify how these networks de-termine an input/output mapping. This lets us define the objective function. Secondly thereis the issue of gradient estimation, meaning we need to find an algorithm for computingthe derivative of this function. The third step is to define an optimization algorithm thatcombines the gradient estimation procedure with a parameter update step.

The output of the network we are interested in is the long-term average behavior. Toguarantee that the network has a regular long-term behavior that is independent of its start-ing point, we will use a condition which is a stochastic analogue of the contraction condi-tion we saw in the previous section.

The general form of our stochastic neural networks are as follows. Let ξ(1),ξ(2),. . . bean infinite sequence of independent identically distributed (i.i.d) Ξ-valued random variablesdistributed according to a measure µ. Each ξ(t) could be for instance a vector of uniform

12

Page 14: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

random variables in [0, 1]. Then the state at time t + 1, denoted x(t + 1), is obtained fromthe state x(t), the parameters θ and the random input ξ(t+ 1) as

x(t+ 1) = f(x(t), θ, ξ(t+ 1)) (9)

For instance the ξ(t) could be a matrix of Bernoulli random variables, that determine whichconnections are activated, in the random connection model we described above. We inter-pret ξi,j = 1 to mean the connection from j to i is activated, and ξi,j = 0 means theconnection is disabled. In this case, f takes the form

fi(x, θ, ξ) = σ

(n∑i=1

ξi,jwi,jxj + bi

)

We are interested in the problem of optimizing the long-term average behavior of recur-sions such as (9). This reduces to the deterministic attractor problem in case there is nonoise (f does not depend on ξ). To define the objective function, we need to put somerestrictions on our process so that it does in fact have a regular long-term behavior. Thecondition will be that the Markov chain defined by (9) possesses a unique variant measureπθ, and that given any distribution on the initial state x(0), the distributions of the statesx(1), x(2), . . . converge to πθ. We can formally state this as follows. Let Pθ be the Markovkernel associated to the stochastic system (9). Given that one is in state x, the quantityPθ(x,A) is the probability that the random next state f(x, θ, ξ) is in the set A. Formally,letting µ be the noise measure,

Pθ(x,A) = µ({ξ | f(x, θ, ξ) ∈ A})

For a function e : X → R we denote by Pθe the function

(Pθe)(x) =

∫X

e(y)d(δxPθ)(y) =

∫Ξ

e(f(x, θ, ξ))dµ(ξ)

This number (Pθe)(x) is the expectation of the value e at the next state of the network,given that we start in state x. For a measure ν on X we denote by νP the measure

(νPθ)(A) =

∫X

Pθ(x,A)dν(x)

Then the algorithm (9) possesses a stationary measure πθ when

πθ = πθPθ (10)

Instead of using the terminology ’globally attractive’, we will say that the network is er-godic. This means that (10) holds, and for any initial measure ν,

νP tθ → πθ as t→∞. (11)

13

Page 15: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

The type of convergence in (11) is weak convergence. A sequence of measures µt convergesweakly to a measure µ if µt(e) → µ(e) as t → ∞ for all bounded, Lipschitz, functionse : X → R.

The optimization problem we are interested in is, given a cost function e : X → R,

minθ∈Θ

J(θ)

whereJ(θ) =

∫X

e(x)dπθ(x)

We approach this in a way that generalizes the deterministic approach described in theprevious section. Our concern is with gradient estimation for this problem. The adjoint-sensitivity analysis has already been described above, in the context of deterministic attrac-tor networks. In Algorithm 5 We define essentially the same algorithm as described fordeterministic networks, only now the algorithm is operating in the stochastic environment:

Algorithm 5: Stochastic Adjoint sensitivity analysisDefine TAdj : Rn × Rn ×Θ× Ξ→ Rn × Rn as

TAdj((x, y), θ, ξ) =

(f(x, θ, ξ), y

∂f

∂x(x, θ, ξ) +

∂e

∂x(x)

)for t = 0, 1, . . . ,M − 1 do

(x(t+ 1), y(t+ 1)) = TAdj((x(t), y(t)), θ, ξ(t))end

- Set ∆Adj = yM∂f

∂θ(x(M), θ, ξ(M))

return ∆Adj

Alternatively, the method of forward sensitivity analysis leads to a gradient estimationprocedure as shown in Algorithm 6.

Algorithm 6: Stochastic Forward sensitivity analysis- Define T Fwd : Rn × Rn×m ×Θ× Ξ→ Rn × Rn×m as

T Fwd((x, u), θ, ξ) =

(f(x, θ, ξ),

∂f

∂x(x, θ, ξ)u+

∂f

∂θ(x, θ, ξ)

)for t = 0, 1, . . . ,M − 1 do

(x(n+ 1), u(n+ 1)) = T Fwd((x(t), u(t)), θ, ξ(t+ 1))end

- Set ∆Fwd =∂e

∂x(xM)uM

return ∆Fwd

14

Page 16: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

3.1 Research questionIn each case, we would like to know if the procedures do in fact help to calculate derivatives,as stated in the following:

Theorem 3.1. Let the network function f , the error function e and the noise ν satisfy somereasonable assumptions. Then the derivative ∂

∂θπθ(e) can be computed using one of the

algorithms (6) or (5). That is, one of the following holds:

i. The process (x(t+ 1), u(t+ 1)) = T Fwd(x(t), u(t), θ, ξ(t+ 1)) is ergodic and

E[∆Fwd]→ ∂J∂θ

(θ)

ii. The process (x(t+ 1), y(t+ 1)) = TAdj(x(t), y(t), θ, ξ(t+ 1)) is ergodic and

E[∆Adj]→ ∂J∂θ

(θ)

3.2 Stochastic contractionWe will define a stochastic version of contraction. This will be useful for analyzing thestochastic sensitivity analysis procedures defined in Algs. 5 and 6. Consider a stochasticalgorithm of the form

x(t+ 1) = f(x(t), ξ(t+ 1)) (12)

where the ξ(t) are i.i.d µ distributed random variables. Let P be the corresponding Markovkernel. It is possible to define a metric on the probability measures in such a way thatcontraction of the Markov kernel P on this metric space reduces to the deterministic no-tion of contraction in the case that the iterations (12) are actually deterministic, that is, notdepending on ξ. This is done through the Wasserstein distance. There are several equiv-alent definitions of the Wasserstein distance. A simple way to define it is as follows. Forprobability measures µ1, µ2 on a metric space,

d(µ1, µ2) = sup‖e‖Lip≤1

|µ1(e)− µ2(e)| = sup‖e‖Lip≤1

∣∣∣∣∫X

e(x)dµ1(x)−∫X

e(x)dµ2(x)

∣∣∣∣To make sure this is well defined we need some requirements on the measure µ1, µ2. Wedefine the Wasserstein space P(X) as follows

P(X) =

∣∣∣∣∣∫X

‖x‖dµ(x) <∞

}

That is, P(X) consists of all the measure µ such that x 7→ ‖x‖ is µ-integrable. It canbe shown that this metric space (P , d) is complete when X is complete [29]. That meansthe contraction mapping theorem can be used to test if a Markov kernel is ergodic; if the

15

Page 17: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

Markov kernel P defines a contraction on this space, then µP n converges to a uniqueinvariant measure π for any starting measure µ. This is formalized below.

Proposition 3.2. Let P be a Markov kernel on a Polish space X . Let the following con-traction condition hold:

supx1 6=x2

d(δx1P, δx2P )

d(x1, x2):= ρ < 1 (13)

Then P has a unique stationary measure π, and

d(µP t, π) ≤ ρtd(µ, π),

and for any Lipschitz function

|µP t(e)− π(e)| ≤ ‖e‖Lipρtd(µ, π).

One can prove the following sufficient condition for contraction in terms of the deriva-tive of f :

Proposition 3.3. Consider the system (12). Let ‖ · ‖ be a norm on X and say that

i. µP ∈ P for all µ ∈ P ,

ii. supx∫

Ξ‖∂f∂x

(x, ξ)‖dν(ξ) < 1.

Then the corresponding Markov kernel P is a contraction on the space P(X).

Processes that satisfy the conditions of 3.3 include iterated function systems [30, 31,32, 33].

3.3 Approach to the problemWe outline our approach to showing, under certain conditions, that the forward sensitivityanalysis procedure is valid for stochastic systems. We assume that Pθ is a contraction onthe state space X . We also need to assume that the function f satisfies certain differen-tiability requirements. The function f and its derivatives will also need to satisfy certainintegrability requirements relative to the noise measure µ.

First, we establish differentiability of πθ. Define ∂∂θPθ to be the linear map sending e to

∂∂θPθe(θ). We show that the equation

l = lPθ + πθ∂∂θPθ (14)

has a unique solution l∗, and that this linear functional must be the derivative of πθ, that is,

l∗(e) = ∂∂θπθ(e)

16

Page 18: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

Then we can show that, asymptotically, the Algorithm 6 represents this linear functional.That is, assuming γθ is the stationary distribution of the recursion zm+1 = T Fwd(zm, θ, ξ),we can consider the linear functional

l(e) =

∫Z

∂e∂x

(x)mdγθ(x,m)

If we can show that l also satisfies the equation (14), then we are done, since the onlysolution is the stationary derivative.

4 Discrete modelsIn this section we discuss a type of stochastic network on a discrete space termed randomthreshold networks, also known as the Little model [7]. Gradient estimation has beenstudied for closely related models, such as the Boltzmann machine and sigmoid beliefnetworks, as we discuss below. This model is somewhat more challenging since there isnot a known closed-form solution for the stationary distribution. Therefore one has to focuson gradient estimation methods that only use the Markov kernel associated to the process.We review some standard methods that can apply to this setting, including finite differences(FD), simultaneous perturbation analysis (SP), and measure valued differentiation (MVD). We then propose an estimator which combines features of SP and MVD. The algorithmgenerates a random direction, as in SP, and then uses measure valued differentiation toapproximate this directional derivative.

4.1 BackgroundThe earliest neural network models to be studied from the computational view were thedeterministic threshold networks [34, 35]. In this model, each unit senses the states of itsneighbors, takes a weighted sum of the values, and applies a threshold to determine its nextstate (either on or off). For single layer versions of these networks, where the units arepartitioned into input and output groups, with connections only from input to output nodes,the corresponding optimization problem can be solved by the perceptron algorithm. Anyiterative algorithm for optimizing threshold networks has to address the credit assignmentproblem. This means that during optimization, the algorithm must identify which internalcomponents of the network are not working correctly, and adjust those units to improve theoutput. The difficulty in solving the credit assignment problem for threshold networks withmultiple layers prevents simple deterministic threshold models from being used in complexproblems like image recognition. There have been a number of well-known approaches tothe problem. For instance, one can abandon the threshold units, and work with units thathave a smooth, graded, response such as the sigmoid neural networks described above.In this case methods of calculus are available to determine unit sensitivities. These newnetworks are still deterministic but now operate on a continuous state space.

17

Page 19: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

Another approach is to keep the space discrete but make the network probabilistic, anduse the smoothing effects of the noise to obtain a model one can apply methods of calculusto. One can interpret the Sigmoid Belief Networks in this way. These networks wereintroduced in [8] and so named because they combine features of sigmoid neural networksand Bayesian networks. In these networks, when a unit receives a large positive input it isvery likely to turn on, while a large negative input means the unit is likely to remain off.In fact, these networks can be interpreted as threshold networks with random thresholds.The use of the sigmoid function, which is the cumulative distribution function (CDF) of thelogistic distribution, leads to an interpretation of a network with thresholds drawn from thelogistic distribution. In [8], the author derived formulas for the gradient in these networks,and showed how Markov chain Monte Carlo (MCMC) techniques can be used to implementgradient estimators. The networks studied in [8] had a feed-forward architecture, but onecould also define variants that allow cycles among the connections. In this way one islead to the random threshold networks. In this case, one would be interested in the long-term average behavior of the network. Such a generalization would resemble the randomthreshold networks that are our focus. It would be interesting to obtain a gradient estimatorfor these new networks.

Another motivation to study general random threshold networks comes from the Boltz-mann machine [1]. This is a network of stochastic units that are connected symmetrically.This means there is feed-back in the network, and the problem in these networks is to op-timize the long-term behavior. The symmetry in the network, and the use of the sigmoidfunction to calculate the probabilities, leads to a nice closed form solution for the stationarymeasure in this model. Based on formulas for the stationary distribution, expressions for thegradient of long-term costs can be obtained, leading to MCMC based gradient estimators.If one changes the model, by for instance using non-symmetric connections, or changingthe type of nonlinearity, these formulas are no longer available. Instead, one winds up witha model like the random threshold networks. This provides another motivation for studyinggradient estimation in the Little model.

The networks that we are studying were first defined in [7], and are sometimes referredto as the Little model. They can be interpreted as threshold networks, where the thresholdsare randomly chosen at each time step. For this reason we also refer to them as RandomThreshold Networks (RTNs). We define a random threshold network as follows. Let thenetwork have n nodes, and let ξ(1), ξ(2) . . . be a sequence of noise vectors in Rn, with theentire collection {ξi(t); i = 1, . . . , n, t = 1, 2, . . .} independent and distributed accordingto the logistic distribution. That is, the CDF of ξi(t) is

P (ξi(t) < x) = σ(x) =1

1 + exp(−x).

Define fRTN : {0, 1}n ×Θ× Ξ→ {0, 1}n as

fRTNi (x, (w, b), ξ) =

1 ifn∑j=1

wi,jxj + bi > ξi

0 otherwise(15)

18

Page 20: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

This function fRTN and the noise ξ(1), ξ(2) . . . determines the operation of the randomthreshold network; from the initial point x(0) one follows the recursion

x(t+ 1) = fRTN(x(t), θ, ξ(t+ 1)) (16)

to generate the next state. The transition probability for the RTN are as follows. We denoteby Pθ(x0, x1) be the probability of going to state x1 ∈ {0, 1}n from state x0 ∈ {0, 1}n. Thefunction ui(x), that determines the input to each node at the state x is defined as

ui(x, θ) =∑j

wi,jxj + bi

Then

Pθ(x0, x1) =

n∏i=1

σ(ui(x0, θ))x

1i (1− σ(ui(x

0, θ)))1−x1i

Alternatively, we can use the following notation of [8]: for x ∈ {0, 1}, define

x∗ = 2x− 1

Using this together with the identity 1− σ(x) = σ(−x), we get

Pθ(x0, x1) =

n∏i=1

σ((x1i )∗ui(x

0, θ)) (17)

The RTN is related to the two models we discussed above, the sigmoid belief network andthe Boltzmann machine. If the connectivity graph is acyclic, then one obtains a model re-sembling the sigmoid belief network. We can enforce this by requiring wi,j = 0 if i < j.A model like the Boltzmann machine is obtained if the weights are symmetric, meaningwi,j = wj,i. Technically, if one puts a symmetry requirement on our threshold networks,one does not exactly recover the Boltzmann machine, but a variant known as the syn-chronous or parallel Boltzmann machine [2]. The synchronous Boltzmann machine alsohas a known, simple, stationary distribution, as shown in [2].

We will not attempt to solve the stationary the distribution, rather we are interestedin constructing an optimization procedure using only knowledge of the transition func-tion fRTN . Firstly, the differentiability of Pθ is easily established by properties of thesigmoid function. As we are dealing with a discrete space, the general methods avail-able for computing the derivative include finite-differences, the score-function method andmeasure-valued differentiation. We briefly review these now.

The finite difference estimator for the derivative of the stationary cost is shown in Algo-rithm 7. In the case of a single, real-valued parameter, the finite difference method uses twocopies of the system to estimate the derivative at a particular point θ(0). One copy of thesystem runs with setting θ(0)+λ, and one uses θ(0)−λ. After running for a long-time, theerror in both of these systems is sampled, and a difference quotient is formed as the gradient

19

Page 21: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

estimate. The extension to m parameters involves replicating the procedure m times, onefor each coordinate direction. Typically, when dealing with an n node neural network thereare ≈ n2 parameters. Therefore finite differences would require running ≈ 2n2 copies ofthe network, which is unfeasible. Furthermore, it has unfavorable variance properties.

Algorithm 7: Finite difference derivative estimation for Markov chains- Let v1, v2, . . . , vm be the coordinate basis vectors in Rm.- Define T FD : Xm ×Xm × Rm × Ξ→ Xm ×Xm as

T FDi (x, y, θ, ξ) = (f(xi, θ + λvi, ξ), f(yi, θ − λvi, ξ)) , i = 1, 2, . . . ,m.

for t = 0, 1, . . . ,M − 1 do(x(t+ 1), y(t+ 1)) = T FD(x(t), y(t), θ, ξ(t+ 1))

end

- Set ∆FD =m∑i=1

e(xi(M))− e(yi(M))

2λvi

return ∆FD

One interesting solution to the cost of estimating derivatives with finite differences isknown as simultaneous perturbation [36]. In this scheme, one picks a random directionv, and then approximates the directional derivative using stochastic finite differences as inAlgorithm 7. In this case only two simulations are needed for a system with m parameters.The variance issues remain with this approach; in order to decrease the bias of the estimator,one has to deal with a larger variance. For generating the directions, one possibility is tolet v be a random point on the hypercube {−1/2, 1/2}m, as suggested in [36]. For thetheoretical analysis, it is important that the directions have 0 mean, and that the randomvariable 1

‖v‖ is integrable. The procedure is shown in Algorithm 8.

Algorithm 8: Simultaneous perturbation derivative estimation for Markov chains- Sample direction v from the measure

P (v) =n∏i=1

[12δ−1/2(vi) + 1

2δ1/2(vi)] (18)

- Define T SP : X ×X × Ξ→ X ×X as

T SP (x, y, ξ) = (f(x, θ + λv, ξ), f(y, θ − λv, ξ))

for t = 0, 1, . . . ,M do(x(t+ 1), y(t+ 1)) = T SP (x(t), y(t), θ, ξ(t))

end

- Set ∆SP =e(x(M))− e(y(M))

2λv

return ∆SP

The idea of measure valued differentiation is to express the derivative of an expectation

20

Page 22: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

as the difference of two expectations. Each of these expectations involves the same costfunction of interest, but the underlying measures are different. If these measures are easy tosample from, this leads to a simple, unbiased derivative estimator. Formally, we considera family of cost functions D and a measure µθ that depends on a real parameter θ. Forsimplicity we will consider the setting of a finite state space X in these definitions. Thenµθ is said to be D-differentiable at θ, if there is a triple (cθ, µθ, µθ), consisting of a realnumber cθ, and two probability measures µθ, µθ on X such that

∂θµθ(e) = cθ (µθ(e)− µθ(e))

for any function e : X → R. An MVD gradient estimator would consist of two parts: First,sample a random variable Y distributed according to µθ, then sample random variable Yaccording to µθ, and finally form the estimate ∆MVD = cθ[e(Y ) − e(Y )]. Compared tofinite differences, the advantage is that there is no bias and there is no division by a smallnumber. For some background see [37].

The following is a simple example

Example 4.1. Let ν1, ν2 be two probability measures on a measurable space X , and definethe measure

µθ = e−θ2

ν1 + (1− e−θ2)ν2 (19)

that depends on a parameter θ ∈ R. The parameter determines which of the measures νi ismore likely in this mixture. By simple calculus, it holds that for any bounded measurablefunction e : X → R,

∂θµθ(e) = cθ[ν2(e)− ν1(e)]

wherecθ = 2θe−θ

2

Therefore the triple (cθ, ν2, ν1) an MVD of the measure µθ.

The concept of MVD can be extended from measures to Markov kernels, and thenapplied to derivatives of stationary costs. A Markov kernel Pθ that is defined on a discretespace X and which depends on a real parameter θ is said to beD-differentiable at θ if foreach x ∈ X there is a triple

(cθ(x), Pθ(x, ·), Pθ(x, ·))

which is the measure valued derivative of the measure P (x, ·) for the cost functions D atthe parameter θ. If the Markov kernel Pθ is ergodic with stationary measure πθ, then incertain cases we can use the (cθ(x), Pθ(x, ·), Pθ(x, ·)) to compute the stationary derivatives∂∂θπθ(e) [38]. We now describe this.

Let the Markov kernels P and P be represented by the functions f , f , meaning

P (x,A) = µ({ξ | f(x, θ, ξ) ∈ A}), P (x,A) = µ({ξ | f(x, θ, ξ) ∈ A}),

21

Page 23: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

The procedure is presented in Algorithm 9. A theorem about its correctness is given inTheorem 4.2. For measure valued differentiation, like finite differences, it seems that msimulations are required for a system with m parameters, although the variance character-istics are much more favorable compared to finite differences (see [39], Section 4.3). Infinite differences, one must trade off bias for variance, but for MVD the variance can beshown to be bounded independently of the parameters M1,M0 which determine the bias.

Algorithm 9: MVD gradient estimation for Markov chainsfor t = 0, 1, . . . ,M0 − 1 do

x(t+ 1) = f(x(t), θ, ξ(t))end- Set x(0) = f(x(M0), θ, ξ(M0)) and x(0) = f(x(M0), θ, ξ(M0))for t = 0, 1, . . . ,M1 − 1 do

(x(t+ 1), x(t+ 1)) = (f(x(t), θ, ξ(M0 + t+ 1)), f(x(t), θ, ξ(M0 + t+ 1)))end

- Set ∆MVD = cθ(x(M))M1∑t=1

[e(x(t))− e(x(t))]

return ∆MVD.

4.2 Stationary differentiability using MVDIn this section we recall a theorem on measure valued differentiation for Markov chains. Itgives a condition on a Markov chain Pθ that guarantees the corresponding stationary costsπθ(e) are differentiable. This result is from [38].

Theorem 4.2. Let (δxPθ)(e) be differentiable for each bounded, Lipschitz continuous e.That is, for each x there is a triple c(x), Pθ(x, ·), Pθ(x, ·) such that

∂θ(δxPθ)(e) = c(x)[Pθ(e)− Pθ(e)]

for each bounded, Lipschitz e. Furthermore, suppose that Pθ is a contraction on the spaceP(X) in the sense of Inequality 13. Then the stationary cost πθ(e) is differentiable, andAlgorithm 9 can be used to estimate the derivatives. Specifically, if we let ∆MVD be theoutput of the algorithm, then

limM0→∞,M1→∞

∆MVD =∂

∂θπθ(e)

More general results on MVD for stationary measures can be found in [40].

4.3 Research questionUsing the above ideas, we propose a gradient estimator that works by picking a randomdirection, as in SP, and computes the directional derivative using measure valued differen-tiation. In this way one deals with a small number of simulations, as in SP, while avoiding

22

Page 24: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

the variance issues with finite differences. The method is termed simultaneous perturbationmeasure valued differentiation. The only requirement is that one can compute the measurevalued derivative along arbitrary directions.

Let µθ be a measure depending on an n-dimensional vector parameter θ. Let v ∈ Rn

be a direction. A triple (cθ,v, µθ,v, µθ,v) is called a measure valued directional directionalderivative in the direction v at θ if for all e : X → R,

∂θµθ(e)v = cθ,v[µθ,v(e)− µθ,v(e)].

In practice, one can try to calculate the MVD in direction v as follows. Consider themeasure µλ = µθ+λv that depends on real parameter λ. Then by basic calculus,

∂θµθ(e)v =

∂λµλ(e)(0)

Therefore, to find the MVD of µθ in direction v it suffices to find the normal, scalar, MVDfor µλ at λ = 0. This is the approach in the following example.

Example 4.3. Let µ1, µ2, . . . , µm be m mixture components, and for a vector parameterθ ∈ Rn define µθ as

µθ =m∑i=1

e−θi

Zθµi

where Zθ =m∑i=1

e−θi . For any function e : X → R and direction v ∈ Rm, then, we have

µθ+λv(e) =m∑i=1

e−(θi+λvi)

Zθ+λviµi(e) (20)

Introduce the notation γ+ = max{0, γ} and γ− = −min{0, γ}, so that for any number γ,

the identities v = γ+ − γ− and |v| = γ+ + γ− hold. Also define Jθ =m∑j=1

e−θ|vj|. Differ-

entiating (20) at λ = 0, and doing some algebra, one can get the following representationfor the directional derivative:

∂θµθ(e)v =

∂λµθ+λv(e)[0] = cθ

[m∑i=1

αiµi(e)−m∑i=1

βiµi(e)

]

wherecθ =

JθZθ

αi =1

ZθJθ

[v−i Zθ +

m∑j=1

e−θiv+j

]

23

Page 25: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

and

βi =1

ZθJθ

[v+i Zθ +

m∑j=1

e−θiv−j

]

Therefore the triple(cθ,

m∑i=1

αiµi,m∑i=1

βiµi

)is the measure valued derivative of µ at the

parameter θ in the direction v.

Defining the extension to Markov chains is straightforward. A triple

(cθ,v, Pθ,v, Pθ,v)

is a measure valued derivative for the Markov kernel P in the direction v if for each x,(cθ,v(x), Pθ,v(x, ·), Pθ,v(x, ·)) is an MVD in the direction v for the measure Pθ(x, ·). Thegradient estimator for stationary costs proceeds by choosing a random direction, and ap-plying the stationary MVD algorithm of 9. The pseudocode is presented as Algorithm10.

Algorithm 10: Simultaneous perturbation measure valued differentiation (SPMVD)-Define the random search direction v as in SP:

P (v) =n∏i=1

[12δ−1/2(vi) + 1

2δ1/2(vi)] (21)

-Let (cθ,v, Pθ,v, Pθ,v). be a measure valued derivative for Pθ in the direction v-Pass Pθ and (cθ,v, Pθ,v, Pθ,v) to scalar MVD (Algorithm 9), to obtain ∆MVDv.-return ∆MVDξ.We propose to study this estimator, both in general and as it applies to the random

threshold networks. The questions of general interest are the performance characteristicsof the estimator, such as the variance, and how the bias converges to zero as the number ofiterations increases. We will also study this in the context of optimization.

This estimator will be applied to our motivating example, the random threshold net-works. Let us now give the measure valued directional derivatives for the Little model. Fixa parameter θ. A direction in the parameter space is represented as a vector v ∈ Rn×n×Rn,where vi,j is a direction along the weight from j to i and vi is a direction along the biasat node i. After some algebra, one can obtain the following expression for the directionalMVD:

∂θ

[∑x1

e(x1)Pθ(x0, x1)

]v = cθ,v(x

0)

[∑x1

e(x1)Pθ,v(x0, x1)−

∑x1

e(x1)Pθ,v(x0, x1)

],

where

cθ,v(x) =n∑i=1

|vi|σ(ui(x0)) +

n∑i=1

n∑j=1

|vi,j|x0jσ(ui(x

0))),

24

Page 26: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

Pθ,v(x0, x1) =

Pθ,v(x0, x1)

cθ(x0)

n∑i=1

[x1i

(v− +

n∑j=1

v−i,jx0j

)+ σ(ui(x

0))

(v+i +

n∑j=1

v+i,jx

0i

)],

and

Pθ,v(x0, x1) =

Pθ,v(x0, x1)

cθ(x0)

n∑i=1

[x1i

(v+ +

n∑j=1

v+i,jx

0j

)+ σ(ui(x

0))

(v−i +

n∑j=1

v−i,jx0i

)].

This yields two Markov kernels Pθ,v and Pθ,v, that depend not just on the parameter θ (aswould be the case in scalar MVD) but also the direction of interest v.

In order for SPMVD to be practical, there must be a practical procedure for runningthe Markov kernels Pθ,v and Pθ,v. The random threshold networks operate on a large statespace, of size 2n when n nodes are used, so this is not necessarily trivial. While the originalMarkov kernel P has a relatively simple structure, each node being updated independently(see Eqn. 17) , the same is not necessarily true for the MVD pair P , P . Therefore asignificant amount of work may be needed to find efficient sampling methods for P and P .

5 Conclusion and Time LineOur proposed work involves gradient estimation and optimization for several neural net-work models that involves feedback. There are both theoretical and applied issues to workon. The problems for deterministic attractor networks is to tune the optimization procedureand find initial conditions that guarantee convergence. We also plan some experiments tosee how they perform on real world problems. For stochastic systems, we plan to portthe forward sensitivity analysis procedure to this setting. The main work that remains tobe done concerns the gradient estimator for the Little model. The properties of the pro-posed gradient estimator, which combines simultaneous perturbation and measure valueddifferentiation, will be investigated. This should take three to four months. The SPMVDwill also be used in numerical experiments, to see how well it works in practice, on bothsynthetic and real world data sets.

25

Page 27: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

Bibliography

[1] Geoffrey E. Hinton and Terrence J. Sejnowski. Optimal perceptual inference. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition.IEEE, 1983.

[2] P Peretto. Collective properties of neural networks: a statistical physics approach.Biological cybernetics, 50(1):51–62, 1984.

[3] B. Apolloni and D. de Falco. Learning by parallel boltzmann machines. IEEE Trans-actions on Information Theory, 37(4):1162–1165, Jul 1991.

[4] Bruno Apolloni and Diego de Falco. Learning by asymmetric parallel boltzmannmachines. Neural Computation, 3(3):402–408, 1991.

[5] Brendan J. Frey and Geoffrey E. Hinton. Variational learning in nonlinear gaussianbelief networks. Neural Comput., 11(1):193–213, January 1999.

[6] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger,editors, Advances in Neural Information Processing Systems 27, pages 2672–2680.Curran Associates, Inc., 2014.

[7] William A Little. The existence of persistent states in the brain. Mathematical bio-sciences, 19(1-2):101–120, 1974.

[8] Radford M Neal. Connectionist learning of belief networks. Artificial intelligence,56(1):71–113, 1992.

[9] Luis B. Almeida. A learning rule for asynchronous perceptrons with feedback in acombinatorial environment. In IEEE First International Conference on Neural Net-works, San Diego, California, 1987. IEEE, New York.

[10] Fernando J. Pineda. Generalization of back-propagation to recurrent neural networks.Phys. Rev. Lett., 59:2229–2232, Nov 1987.

[11] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning represen-tations by back-propagating errors. Nature, 323:533–536, 1986.

26

Page 28: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

[12] Patrik Floreen and Pekka Orponen. On the computational complexity of analyzinghopfield nets. Complex Systems, 3(6):577–587, 1989.

[13] Amir F. Atiya. Learning on a general network. In D.Z. Anderson, editor, NeuralInformation Processing Systems, pages 22–30. American Institute of Physics, 1988.

[14] Fernando J Pineda. Dynamics and architecture for neural computation. Journal ofComplexity, 4(3):216–245, 1988.

[15] Fernando J Pineda. Recurrent backpropagation and the dynamical approach to adap-tive neural computation. Neural Computation, 1(2):161–172, 1989.

[16] Pierre Baldi. Gradient descent learning algorithm overview: A general dynamicalsystems perspective. Neural Networks, IEEE Transactions on, 6(1):182–195, 1995.

[17] B. A. Pearlmutter. Gradient calculations for dynamic recurrent neural networks: asurvey. IEEE Transactions on Neural Networks, 6(5):1212–1228, Sep 1995.

[18] Arthur Earl Bryson. Applied optimal control: optimization, estimation and control.CRC Press, 1975.

[19] A. Griewank and A. Walther. Evaluating Derivatives. Society for Industrial andApplied Mathematics, second edition, 2008.

[20] Michael B Giles and Niles A Pierce. An introduction to the adjoint approach todesign. Flow, turbulence and combustion, 65(3-4):393–415, 2000.

[21] Adel Hamdi and Andreas Griewank. Reduced quasi-newton method for simultaneousdesign and optimization. Computational Optimization and Applications, 49(3):521–548, 2011.

[22] Ricardo Riaza and PedroJ. Zufiria. Time-scaling in recurrent neural learning. In JosR.Dorronsoro, editor, Artificial Neural Networks ICANN 2002, volume 2415 of LectureNotes in Computer Science, pages 1371–1376. Springer Berlin Heidelberg, 2002.

[23] Ali Saberi and Hassan Khalil. Quadratic-type lyapunov functions for singularly per-turbed systems. IEEE Transactions on Automatic Control, 29(6):542–550, 1984.

[24] Ricardo Riaza and Pedro J Zufiria. Differential-algebraic equations and singular per-turbation methods in recurrent neural learning. Dynamical Systems: An InternationalJournal, 18(1):89–105, 2003.

[25] Dawei W Dong and John J Hopfield. Dynamic properties of neural networks withadapting synapses. Network: Computation in Neural Systems, 3(3):267–283, 1992.

[26] Anke Meyer-Baese, Frank Ohl, and Henning Scheich. Singular perturbation analy-sis of competitive neural networks with different time scales. Neural Computation,8(8):1731–1742, 1996.

27

Page 29: Gradient estimation for attractor networks › CUNY_GC › media › Computer... · attractor networks, based on the forward and adjoint sensitivity analysis methods. Then we consider

[27] Andreas Griewank. Projected Hessians for Preconditioning in One-Step One-ShotDesign Optimization, pages 151–171. Springer US, Boston, MA, 2006.

[28] Mantas Lukosevicius and Herbert Jaeger. Reservoir computing approaches to recur-rent neural network training. Computer Science Review, 3(3):127–149, 2009.

[29] Cedric Villani. Optimal transport: old and new, volume 338. Springer Science &Business Media, 2008.

[30] David Steinsaltz. Locally contractive iterated function systems. Ann. Probab.,27(4):1952–1979, 10 1999.

[31] Orjan Stenflo. A survey of average contractive iterated function systems. Journal ofDifference Equations and Applications, 18(8):1355–1380, 2012.

[32] M. F. Barnsley and S. Demko. Iterated function systems and the global constructionof fractals. Proceedings of the Royal Society of London A: Mathematical, Physicaland Engineering Sciences, 399(1817):243–275, 1985.

[33] Michael F. Barnsley and John H. Elton. A new class of markov processes for imageencoding. Advances in Applied Probability, 20(1):14–32, 1988.

[34] Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent innervous activity. Bulletin of Mathematical Biophysics, 5:115 – 133, 1943.

[35] F. Rosenblatt. The perceptron: A probabilistic model for information storage andorganization in the brain. Psychological Review, pages 65–386, 1958.

[36] James C Spall. Multivariate stochastic approximation using a simultaneous perturba-tion gradient approximation. IEEE transactions on automatic control, 37(3):332–341,1992.

[37] Bernd Heidergott, Felisa J. Vazquez-Abad, Georg Pflug, and Taoying Farenhorst-Yuan. Gradient estimation for discrete-event systems by measure-valued differentia-tion. ACM Trans. Model. Comput. Simul., 20(1):5:1–5:28, February 2010.

[38] Georg Ch. Pflug. Gradient estimates for the performance of markov chains and dis-crete event processes. Annals of Operations Research, 39(1):173–194, 1992.

[39] Georg Ch. Pflug. Optimization of stochastic models: The interface between simula-tion and optimization, volume 373. Kluwer Academic Publishers, 1996.

[40] Bernd Heidergott and Arie Hordijk. Taylor series expansions for stationary markovchains. Advances in Applied Probability, pages 1046–1070, 2003.

28