Upload
phunghanh
View
220
Download
0
Embed Size (px)
Citation preview
Neural Networks II
Robert Jacobs
Department of Brain & Cognitive SciencesUniversity of Rochester
October 26, 2015
Nonlinear Activation Functions
(1) Threshold activation function
y =
{1 if ~wT~x > 00 otherwise
0
1
Inspired by neural “all or none” principle
Example: Exclusive-OR (XOR)
x1 x2 y∗
0 0 00 1 11 0 11 1 0
O X
X O
Let θ = bias weight
(0, 0)→ 0 means that θ < 0(0, 1)→ 1 means that w2 + θ > 0(1, 0)→ 1 means that w1 + θ > 0(1, 1)→ 0 means that w1 + w2 + θ < 0
⇒ Not possible to satisfy all of these constraints
But, but, but...what about:
x1 x2 x1 × x2 y∗
0 0 0 00 1 0 11 0 0 11 1 1 0
XOR
⇒ New representation formed from original variables makes XORtask performable!!!
Q: In general, how do we find new representations that are usefulfor a given task?
(2) Logistic activation function
y =1
1 + exp[−~wT~x]
“Soft” approximation to threshold activation function
Relation to mixture model with two normal components
I Data: {xt}Tt=1
I Assume xt came from one of two normal distributions
I Assume distributions have equal variances (σ21 = σ2
2 = 1)
I Assume distributions have equal priors (π1 = π2 = 0.5)
Recall:
p(i|xt) =p(xt|i)πip(xt)
=p(xt|i)πi∑2j=1 p(xt|j)πj
p(1|xt) =
1√2π
exp[−12(xt − µ1)
2]π11√2π
exp[−12(xt − µ1)2]π1 +
1√2π
exp[−12(xt − µ2)2]π2
=exp[−1
2(xt − µ1)
2]
exp[−12(xt − µ1)2] + exp[−1
2(xt − µ2)2]
=1
1 +exp[− 1
2(xt−µ2)2]
exp[− 12(xt−µ1)2]
=1
1 + exp[−12[(xt − µ2)2 − (xt − µ1)2]]
=1
1 + exp[−12[distance from µ2 - distance from µ1]]
Let ~wT~x = −12[(xt − µ2)
2 − (xt − µ1)2]
Logistic function gives the posterior probability that the input xt isfrom one of the two components
I If ~wT~x = 0 (y = 0.5), xt is equally likely to belong to eithercomponent
I If ~wT~x > 0 (y > 0.5), xt is more likely to belong to one ofthe components
I If ~wT~x < 0 (y < 0.5), xt is more likely to belong to the othercomponent
⇒ Logistic units are binary feature detectors
I Relation to Bernoulli distribution (y = probability of somebinary event)
(3) Softmax activation function
yi =exp[~wTi ~x]∑j exp[~w
Tj ~x]
Because yi ∈ (0, 1) and∑
i yi = 1, the {yi} form a multinomialprobability distribution
Relation to “soft” winner-take-all
Relation to logistic:
exp[~wT1 ~x]
exp[~wT1 ~x] + exp[~wT2 ~x]=
1
1 + exp[~wT2 ~x− ~wT1 ~x]
Two classes → use logisticMore than two classes → use softmax
Multilayer Neural Networks
Hidden units: use logistic activation functionOutput units: use linear, logistic, or softmax function dependingon the nature of the task
Q: Why are hidden units nonlinear?A: A linear function composed with another linear function resultsin a linear function
y = d(a1x1 + b1x2 + c1) + e(a2x1 + b2x2 + c2) + f
= da1x1 + db1x2 + dc1 + ea2x1 + eb2x2 + ec2 + f
Let g = da1 + ea2 , h = db1 + eb2, and k = dc1 + ec2 + f
Theny = gx1 + hx2 + k
Exception: Neural Networks and PCA
Auto-associator: Instance when one might want linear hidden units
65,536
input
units
7 hidden
units
65,536
output
units
Relation between neural networks and PCA: After training, spacespanned by the 7 hidden units’ weight vectors is the same as thespace spanned by the 7 eigenvectors with the largest eigenvalues ofthe data covariance matrix
Q: Is this supervised or unsupervised training?
(Nonlinear) Neural Networks areUniversal Approximators
input
units
hidden
units
output
unit
For any function, there exists a neural network with exactly onelayer of hidden units (hidden units use the logistic activationfunction; output unit is linear) that can approximate that functionarbitrarily close
Theorem does not say how many hidden unitsTheorem does not say how to find the weightsThis is a theorem about representation, not learning
Alternative Method for TrainingNonlinear Neural Networks
Step 1: Run clustering algorithm with a large number ofcomponents
I Each mixture component is a hidden unit
I Posterior probability of a component (given an input) is thehidden unit’s activation
I Provides a (nonlinear) “covering” of the space of inputs
Step 2: Run LMS algorithm to learn weights from hidden units tooutput unit
⇒ Relation to “kernel based” methods
Yet Another Alternative Method forTraining Nonlinear Neural Networks
Linear Basis Function Models:
y = ~wT ~φ(~x)
=∑i
wiφi(~x)
where φi(~x) are fixed functions known at basis functions.
When φi(~x) are nonlinear basis functions, then output y is anonlinear function of input ~x.
Polynomial regression: φi(x) = xi (assuming single input x)
φ0(x) = 1φ1(x) = xφ2(x) = x2
φ3(x) = x3
y =∑
iwiφi(x)
Gaussian basis functions:
φi(x) = exp
{−(x− µi)2
2s2
}
Sigmoidal basis functions:
φi(x) =1
1 + exp(x−µis
)
Rueckl, J. G., Cave, K. R., & Kosslyn, S. M. (1989). Why are“what” and “where” processed by separate cortical visual systems?A computational investigation. Journal of Cognitive Neuroscience,1, 171-186.
In the primate visual system:I object identificationI spatial localization
are performed by different cortical pathways.
Compared two systems:I One network for both tasksI Two networks—one for each task
Results: Two-network system betterI learns fasterI representation more interpretable
Sejnowski, T. J. & Rosenberg, C. R. (1987). Parallel networksthat learn to pronouce English text. Complex Systems, 1, 145-168.
Elman, J. L. (1990). Finding structure in time. Cognitive Science,14, 179-211.
S → NP VP “.”NP → PropN | N | N RCVP → V (NP)RC → who NP VP | who VP (NP)N → boy | girl | cat | boys | girls | catsPropN → John | MaryV → chase | feed | see | chases | feeds | sees
Language has a componential structure
Learning language: Determine the componential structure fromthe sentences of the language
Rules coordinate dependencies among words:
I John chases cats.
I Boys chase cats.
⇒ Requires memory
Zipser, D. & Andersen, R. A. (1988). A back-propagationprogrammed network that simulates response properties of asubset of posterior parietal neurons. Nature, 331, 679-684.
Hinton, G. E., Plaut, D. C. & Shallice, T. (1993). Simulatingbrain damage. Scientific American, 269, 76-82.
Deep dyslexia:I (1) Visual errors:
I stock → “shock”I crowd → “crown”
I (2) Semantic errors:I symphony → “orchestra”I uncle → “nephew”
I (3) Visual and semantic errors:I sympathy → “orchestra”I cat → “bed”