Neural Networks II - bcs.rochester.edu · exp[1(x t 1)2] ˇ 1 + p1 2ˇ exp[1(x t 2) ... I sympathy !\orchestra" I cat !\bed" Title: Neural Networks II Author: Robert Jacobs Created

$Page 1: Neural Networks II - bcs.rochester.edu · exp[1(x t 1)2] ˇ 1 + p1 2ˇ exp[1(x t 2) ... I sympathy !\orchestra" I cat !\bed" Title: Neural Networks II Author: Robert Jacobs Created$
Neural Networks II

Robert Jacobs

Department of Brain & Cognitive SciencesUniversity of Rochester

October 26, 2015

Nonlinear Activation Functions

(1) Threshold activation function

y =

{1 if ~wT~x > 00 otherwise

0

1

Inspired by neural “all or none” principle

Example: AND function

x1 x2 y∗

0 0 00 1 01 0 01 1 1

AND

O

O

O

X

Example: OR function

x1 x2 y∗

0 0 00 1 11 0 11 1 1

OR

O

X

X

X

Example: Exclusive-OR (XOR)

x1 x2 y∗

0 0 00 1 11 0 11 1 0

O X

X O

Let θ = bias weight

(0, 0)→ 0 means that θ < 0(0, 1)→ 1 means that w2 + θ > 0(1, 0)→ 1 means that w1 + θ > 0(1, 1)→ 0 means that w1 + w2 + θ < 0

⇒ Not possible to satisfy all of these constraints

But, but, but...what about:

x1 x2 x1 × x2 y∗

0 0 0 00 1 0 11 0 0 11 1 1 0

XOR

⇒ New representation formed from original variables makes XORtask performable!!!

Q: In general, how do we find new representations that are usefulfor a given task?

(2) Logistic activation function

y =1

1 + exp[−~wT~x]

“Soft” approximation to threshold activation function

Relation to mixture model with two normal components

I Data: {xt}Tt=1

I Assume xt came from one of two normal distributions

I Assume distributions have equal variances (σ21 = σ2

2 = 1)

I Assume distributions have equal priors (π1 = π2 = 0.5)

Recall:

p(i|xt) =p(xt|i)πip(xt)

=p(xt|i)πi∑2j=1 p(xt|j)πj

p(1|xt) =

1√2π

exp[−12(xt − µ1)

2]π11√2π

exp[−12(xt − µ1)2]π1 +

1√2π

exp[−12(xt − µ2)2]π2

=exp[−1

2(xt − µ1)

2]

exp[−12(xt − µ1)2] + exp[−1

2(xt − µ2)2]

=1

1 +exp[− 1

2(xt−µ2)2]

exp[− 12(xt−µ1)2]

=1

1 + exp[−12[(xt − µ2)2 − (xt − µ1)2]]

=1

1 + exp[−12[distance from µ2 - distance from µ1]]

Let ~wT~x = −12[(xt − µ2)

2 − (xt − µ1)2]

Logistic function gives the posterior probability that the input xt isfrom one of the two components

I If ~wT~x = 0 (y = 0.5), xt is equally likely to belong to eithercomponent

I If ~wT~x > 0 (y > 0.5), xt is more likely to belong to one ofthe components

I If ~wT~x < 0 (y < 0.5), xt is more likely to belong to the othercomponent

⇒ Logistic units are binary feature detectors

I Relation to Bernoulli distribution (y = probability of somebinary event)

(3) Softmax activation function

yi =exp[~wTi ~x]∑j exp[~w

Tj ~x]

Because yi ∈ (0, 1) and∑

i yi = 1, the {yi} form a multinomialprobability distribution

Relation to “soft” winner-take-all

Relation to logistic:

exp[~wT1 ~x]

exp[~wT1 ~x] + exp[~wT2 ~x]=

1

1 + exp[~wT2 ~x− ~wT1 ~x]

Two classes → use logisticMore than two classes → use softmax

Multilayer Neural Networks

Hidden units: use logistic activation functionOutput units: use linear, logistic, or softmax function dependingon the nature of the task

Q: Why are hidden units nonlinear?A: A linear function composed with another linear function resultsin a linear function

h1 = a1x1 + b1x2 + c1

h2 = a2x1 + b2x2 + c2

y = dh1 + eh2 + f

y = d(a1x1 + b1x2 + c1) + e(a2x1 + b2x2 + c2) + f

= da1x1 + db1x2 + dc1 + ea2x1 + eb2x2 + ec2 + f

Let g = da1 + ea2 , h = db1 + eb2, and k = dc1 + ec2 + f

Theny = gx1 + hx2 + k

Exception: Neural Networks and PCA

Auto-associator: Instance when one might want linear hidden units

65,536

input

units

7 hidden

units

65,536

output

units

Relation between neural networks and PCA: After training, spacespanned by the 7 hidden units’ weight vectors is the same as thespace spanned by the 7 eigenvectors with the largest eigenvalues ofthe data covariance matrix

Q: Is this supervised or unsupervised training?

(Nonlinear) Neural Networks areUniversal Approximators

input

units

hidden

units

output

unit

For any function, there exists a neural network with exactly onelayer of hidden units (hidden units use the logistic activationfunction; output unit is linear) that can approximate that functionarbitrarily close

Theorem does not say how many hidden unitsTheorem does not say how to find the weightsThis is a theorem about representation, not learning

Method for Training Nonlinear Neural Networks

Backpropagation Algorithm (see next class)

Alternative Method for TrainingNonlinear Neural Networks

Step 1: Run clustering algorithm with a large number ofcomponents

I Each mixture component is a hidden unit

I Posterior probability of a component (given an input) is thehidden unit’s activation

I Provides a (nonlinear) “covering” of the space of inputs

Step 2: Run LMS algorithm to learn weights from hidden units tooutput unit

⇒ Relation to “kernel based” methods

Yet Another Alternative Method forTraining Nonlinear Neural Networks

Linear Basis Function Models:

y = ~wT ~φ(~x)

=∑i

wiφi(~x)

where φi(~x) are fixed functions known at basis functions.

When φi(~x) are nonlinear basis functions, then output y is anonlinear function of input ~x.

Polynomial regression: φi(x) = xi (assuming single input x)

φ0(x) = 1φ1(x) = xφ2(x) = x2

φ3(x) = x3

y =∑

iwiφi(x)

Gaussian basis functions:

φi(x) = exp

{−(x− µi)2

2s2

}

Sigmoidal basis functions:

φi(x) =1

1 + exp(x−µis

)

Neural Network Applications(networks trained with the backpropagation algorithm)

Rueckl, J. G., Cave, K. R., & Kosslyn, S. M. (1989). Why are“what” and “where” processed by separate cortical visual systems?A computational investigation. Journal of Cognitive Neuroscience,1, 171-186.

In the primate visual system:I object identificationI spatial localization

are performed by different cortical pathways.

Compared two systems:I One network for both tasksI Two networks—one for each task

Results: Two-network system betterI learns fasterI representation more interpretable




Sejnowski, T. J. & Rosenberg, C. R. (1987). Parallel networksthat learn to pronouce English text. Complex Systems, 1, 145-168.


Elman, J. L. (1990). Finding structure in time. Cognitive Science,14, 179-211.

S → NP VP “.”NP → PropN | N | N RCVP → V (NP)RC → who NP VP | who VP (NP)N → boy | girl | cat | boys | girls | catsPropN → John | MaryV → chase | feed | see | chases | feeds | sees

Language has a componential structure

Learning language: Determine the componential structure fromthe sentences of the language

Rules coordinate dependencies among words:

I John chases cats.

I Boys chase cats.

⇒ Requires memory



Zipser, D. & Andersen, R. A. (1988). A back-propagationprogrammed network that simulates response properties of asubset of posterior parietal neurons. Nature, 331, 679-684.




Hinton, G. E., Plaut, D. C. & Shallice, T. (1993). Simulatingbrain damage. Scientific American, 269, 76-82.

Deep dyslexia:I (1) Visual errors:

I stock → “shock”I crowd → “crown”

I (2) Semantic errors:I symphony → “orchestra”I uncle → “nephew”

I (3) Visual and semantic errors:I sympathy → “orchestra”I cat → “bed”





Documents

Neural Networks II - bcs.rochester.edu · exp[1(x t 1)2] ˇ 1 + p1 2ˇ exp[1(x t 2) ... I sympathy !\orchestra" I cat !\bed" Title: Neural Networks II Author: Robert Jacobs Created