Neural network

The neural network connectionist architectures are not meant to duplicate the operation of the

human brain, but rather to receive inspiration from known facts about how the brain works. They

are characterized by having:

A large number of very simple neuron like processing elements.

A large number of weighted connections between the elements. The weights on the

connections encode the knowledge of a network.

Highly parallel, distributed control.

An emphasis on learning internal representations automatically.

Hopfield Networks:

Neural network researchers contend that there is a basic mismatch between standard computer

information processing technology and the technology used by the brain.

Hopfield introduced a neural network that he proposed as a theory of memory. A Hopfield

network has the following interesting features:

Distributed Representation – A memory is stored as a pattern of activation across a set of

processing elements. Furthermore, memories can be superimposed on one another;

different memories are represented by different patterns over the same set of processing

elements.

Distributed, Asynchronous Control – Each processing element makes decisions based

only on its own local situation. All these local actions add up to a global solution.

Content-Addressable Memory – A number of patterns can be stored in a network. To

receive a pattern, we need only specify a portion of it. The network automatically finds

the closest match.

Fault Tolerance – If a few processing elements misbehave or fail completely; the network

will still function properly.

A simple Hopfield net is shown in figure 1. Processing elements, or units, are always in one of

two states, active or inactive. In the figure, units colored black are active and nits colored white

are inactive. Units are connected to each other with weighted, symmetric connections. A

positively weighted connection indicates that the units tend to activate each other. A negative

connection allows an active unit to deactivate a neighboring unit.

The network operates as follows. A random unit is chosen. If any of its neighbors are active, the

unit computes the sum of the weights on the connections to those active neighbors if the sum is

positive, the unit becomes active, otherwise it becomes inactive. Another random unit is chosen,

and the process repeats until the network reaches a stable states, i.e., until no more units can

change state. This process is called parallel relaxation. If the network starts in the state shown in

figure 1, the unit in the lower left corner will tend to activate the unit above it. This unit, in turn,

will attempt to activate the unit above it, but the inhibitory connection from the upper-right until

will foil this attempt, and so on.

This network has only four distinct stable states, which are shown in figure 2. Given any initial

state, the network will necessarily settle into one of these four configurations. The stable state in

which all units are inactive can only be reached if it is also the initial state. The network can be

thought of as storing the patterns in figure 2. Hopfield’s major contribution was to show that

given any set of weights and any initial state, his parallel relaxation algorithm would eventually

steer the network into a stable state. There can be no divergence or oscillation.

The network can be used as a content-addressable memory by setting the activities of the units to

correspond to a partial pattern. To receive a pattern, we need only supply a portion of it. The

network will then settle into the stable state that best matches the partial pattern. An example in

shown in figure 3.

Parallel relaxation is nothing more than search. It is useful to think of the various states of a

network as forming a search space, as in figure 4. A randomly chosen state will transform itself

ultimately into one of the local minima, namely the nearest stable state. This is how we get the

content-addressable behavior. In figure 4, state B is depicted as being lower than state A because

fewer constraints are violated. A constraint is violated, for example, when two active units are

connected by a negatively weighted connection.

Content-Addressable Memory

a function or map

can learn some patterns (e.g. image of letter “J”) and given a partial pattern (slightly

scrambled image of “J”) will reproduce the nearest learnt pattern

Hopfield network has units that are either active or passive

weighted, symmetric connections between units

a pattern can be described as some particular combination of active units in a state (each

unit has a unique name or identifier)

learning a pattern = adjusting the weights

recalling a pattern = parallel relaxation algorithm

parallel relaxation algorithm: boulder-valley analogy

o we have a boulder rolling down a valley

o the bottom of the valley represents the pattern that the Hopfield net has learnt

o wherever the boulder is initially placed, it will roll towards the nearest local

minimum – this represents the Hopfield net iteratively processing the next

network state

o the boulder will eventually stop rolling at the bottom of the valley – this

represents the stable state of the Hopfield network

o our landscape can have multiple valleys – this represents the Hopfield network

learning multiple patterns; note “crafting the landscape” represents “learning”

o depending on where the boulder is initially placed it will roll towards the nearest

valley – given some partial pattern, the Hopfield network (using the parallel

relaxation algorithm) will eventually stabilize at the closest matching pattern

Perceptrons

perceptron “learning” means adjusting the weights w0…wn

activation function

o each unit (neuron) has n incoming connections (x1…xn) each with weights

(w1…wn)

o by default input x0 =1 meaning that w0 always applies – this acts as the threshold

as follows:

initial -2

step 1

+3

step 2

+3

step 3

-1

step 4

-3

step 5

+3

step 6 (continue until stable)

o implicit threshold: the sum of all inputs must be greater then or equal to 0

e.g. for 2 inputs: w0 + w1x1 + w2x2 ≥ 0

o explicit threshold: the sum of all inputs must be greater than or equal to value t

(threshold)

note that we sum from i=1 to n (rather than from i=0)

So what about w0? If t= w0 then it’s the same as summing from i=0

w1x1 + w2x2 ≥ t (here’s the explicit threshold)

t + w1x1 + w2x2 ≥ 0

w0 + w1x1 + w2x2 ≥ 0 (we’re back to an implicit threshold)

perceptron can only represent linearly separable functions

o e.g. for two inputs (two dimensional system) it can separate the black and white

dots because it can draw a line through the graph that divides them

o line equation:

y = mx + b

o perceptron activation function as a line equation:

x2 = w1x1 + w0

w2 w2

o note that this applies to any number of dimensions (any i≥1 used as xi )

e.g. 1-dimensional system might look like:

where the perceptron needs to find a good point (-w0 / w1) that splits the

white and black dots

e.g. 3-dimensional system might look like:

…where the perceptron needs to find a good plane

o perceptrons can’t learn functions that are not linearly separable

e.g. XOR:

Let x1 and x2 only take values of either 0 or 1

if x1=x2 then dot is white (represents Boolean false)

if x1≠x2 then dot is black (represents Boolean true)

we can’t draw a straight line through this graph that separates the white

dots from the black dots

e.g. same problem in 1D as well:

we can’t choose a single point that will separate the black and white dots

o solution: multilayer networks

e.g. multilayer network with one hidden layer:

o Why does this solve the problem?

hidden layers increase the hypothesis space that our neural network can

represent

think of each hidden unit as a perceptron that draws a line through our 2D

graph (to classify dots as black or white)

so now our output unit can take the combination of these lines

e.g. consider XOR function again

Let x1 and x2 only take values of either 0 or 1

units a1 and a2 represent inputs x1 and x2 respectively

perceptron a3

a1 + a2 ≥ 0.5

perceptron a4

a1 + a2 ≥ 1.5

the perceptron will give a “1” (on) if the input (dot) lies on the side of the

line that has a big red “1”

now we’ll combine a3 and a4 to capture the XOR function

we want: a3=1 a4=0 to give a “1” in our network

note that no dot can satisfy a3=0 a4=1 (compare the graphs

above)

the complete network is on the right

perceptron a5

a3 a4 ≥ 0.5

Complete multilayer feed-forward neural net

note that there are lots of ways (infinite) to implement XOR in a neural

network

the error values at the “hidden” layer are sort of mysterious because the

training data only gives values for the input and the desired output

(doesn’t specify what values the hidden layers should take)

“learning” in multilayer feed-forward neural nets uses the back-

propagation algorithm

the error from the output layer is propagated back through to the

hidden layers (assess blame by dividing the error up among the

contributing weights)

Engineering

Neural network