CS440/ECE448 Lecture 8: Perceptron
Mark Hasegawa-Johnson, 2/2021
License: CC-BY 4.0; redistribute at will, as long as you cite the source.
Outline
β’ A little history: perceptron as a model of a biological neuronβ’ The perceptron learning algorithmβ’ Linear separability, π¦π₯, and convergence
β’ It converges to the right answer, even with π = 1, if data are linearly separable
β’ If data are not linearly separable, itβs necessary to use π = 1/π.
β’ Multi-class perceptron
The Giant Squid Axon β’ 1909: Williams describes the giant squid axon (III: 1mm thick)β’ 1939: Young describes the
synapse.β’ 1952: Hodgkin & Huxley
publish an electrical current model for the generation of binary action potentials from real-valued inputs.
Image released to the public domain by lkkisan, 2007.Modified from LlinΓ‘s, Rodolfo R. (1999). The squid Giant Synapse.
Perceptron β’ 1959: Rosenblatt is granted a patent for the βperceptron,β an electrical circuit model of a neuron.
Perceptron
1
x1
xD
b
w1
w2
x2
wD
Input
Weights
.
.
.
Output: sgn(wTx)
Can incorporate bias as component of the weight vector by always including a feature with value set to 1
Perceptron model: action potential = signum(affine function of the features)
#π¦ = sgn(π€1π₯1+ β¦ +π€π·π₯π· + π)= sgn(π€!π₯)
Where π€ = [π€", β¦ ,π€#, π]!,π₯ = [π₯", β¦ , π₯#, 1]!, and
sgn(π₯) = 2 1 π₯ β₯ 0β1 π₯ < 0
Outline
β’ A little history: perceptron as a model of a biological neuronβ’ The perceptron learning algorithmβ’ Linear separability, π¦π₯, and convergence
β’ It converges to the right answer, even with π = 1, if data are linearly separable
β’ If data are not linearly separable, itβs necessary to use π = 1/π.
β’ Multi-class perceptron
PerceptronRosenblattβs big innovation: the perceptron learns from examples.β’ Initialize weights randomlyβ’ Cycle through training
examples in multiple passes (epochs)β’ For each training example:β’ If classified correctly, do
nothingβ’ If classified incorrectly,
update weightsBy Elizabeth Goodspeed - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=40188333
PerceptronFor each training instance π with ground truth label π¦ β {β1,1}:β’ Classify with current weights: !π¦ = sgn(π€&π₯)β’ Update weights: β’ if π¦ = ,π¦ then do nothingβ’ If π¦ β ,π¦ then π€ = π€ + Ξ·yπ₯β’ Ξ· (eta) is a βlearning rate.β For now, letβs assume Ξ·=1.
Perceptron example: dogs versus catsCan you write a program that can tell which ones are dogs, and which ones are cats?
π₯! = # times the animal comes when called (out of 40).π₯" = weight of the animal, in pounds.π₯ = [π₯!, π₯", 1]#.π¦ = 1 means βdogβπ¦ = β1 means βcatβ
-π¦ = sgn π€#π₯
Perceptron training example: dogs vs. catsβ’ Letβs start with the rule βif it comes when called (by at least 20 different people
out of 40), itβs a dog.ββ’ Write that as an equation: -π¦ = sgn(π₯! β 20)β’ Write that as a vector equation: -π¦ = sgn(π€#π₯), where π€# = 1,0, β20
π₯!
π₯"
π€# = 1,0,β20
sgn π€#π₯ = 1
sgn π€#π₯ = β1
Perceptron training example: dogs vs. catsβ’ The Presa Canario gets misclassified as a cat (π¦ = 1, but -π¦ = β1) because it only
obeys its trainer, and nobody else (π₯! = 1). β’ Though it rarely comes when called, is very large (π₯" = 100 pounds). β’ οΏ½βοΏ½# = π₯!, π₯", 1 = [1,100,1].
π₯"
π₯!
π€# = 1,0,β20
sgn π€#π₯ = 1
sgn π€#π₯ = β1
Perceptron training example: dogs vs. catsβ’ The Presa Canario gets misclassified. π₯# = π₯!, π₯", 1 = [1,100,1].β’ Perceptron learning rule: update the weights as:
π€ = π€ + π¦π₯ =10
β20+ 1Γ
11001
=2100β19
π₯"
π₯!
New π€ =2100β19
sgn π€#π₯ = 1
sgn π€#π₯ = β1
Old π€ =10β20
Perceptron training example: dogs vs. catsβ’ The Maltese is small (π₯" = 10 pounds) and very tame (π₯! = 40): π₯ =
40101
.
β’ Itβs correctly classified! -π¦ = sgn π€#π₯ = sgn 2Γ40 + 100Γ10 β 19 = +1,β’ so π€ is unchanged.
π₯"
π₯!
sgn π€#π₯ = 1
π€# = 2,100,β19
sgn π€#π₯ = β1
Perceptron training example: dogs vs. catsβ’ The Maine Coon cat is big (π₯" = 20 pounds: οΏ½βοΏ½ = 0,20,1 ), so it gets misclassified
as a dog (true label is π¦ = β1=βcat,β but the classifier thinks -π¦ = 1=βdogβ).
π₯"
π₯!
sgn π€#π₯ = 1
sgn π€#π₯ = β1
π€# = 2,100,β19
Perceptron training example: dogs vs. catsβ’ The Maine Coon cat is big (π₯" = 20 pounds: οΏ½βοΏ½ = 0,20,1 ), so it gets
misclassified, so we update w:
π€ = π€ + π¦π₯ =2100β19
+ (β1)Γ0201
=280β20
π₯"
π₯!
sgn π€#π₯ = 1
sgn π€#π₯ = β1
π€# = 2,80,β20
Outline
β’ A little history: perceptron as a model of a biological neuronβ’ The perceptron learning algorithmβ’ Linear separability, π¦π₯, and convergence
β’ It converges to the right answer, even with π = 1, if data are linearly separable
β’ If data are not linearly separable, itβs necessary to use π = 1/π.
β’ Multi-class perceptron
Does the perceptron find an answer?Does it find the correct answer?Suppose we run the perceptron algorithm for a very long time. Will it converge to an answer? Will it converge to the correct answer?
Does the perceptron find an answer?Does it find the correct answer?Suppose we run the perceptron algorithm for a very long time. Will it converge to an answer? Will it converge to the correct answer? That depends on whether or not the data are linearly separable.
Does the perceptron find an answer?Does it find the correct answer?Suppose we run the perceptron algorithm for a very long time. Will it converge to an answer? Will it converge to the correct answer? That depends on whether or not the data are linearly separable.βLinearly separableβ means itβs possible to find a line (a hyperplane, in D-dimensional space) that separates the two classes, like this:
π₯<
π₯=
Does the perceptron find an answer?Does it find the correct answer?Suppose that, instead of plotting π₯, we plot π¦π₯.β’ If π¦ = 1, plot π₯β’ If π¦ = β1, plot βπ₯
π₯<
π₯=
Does the perceptron find an answer?Does it find the correct answer?Notice that the original data (π₯) are linearly separable if and only if the signed data (π¦π₯) are all in the same half-plane.
π¦π₯<
π¦π₯=
Does the perceptron find an answer?Does it find the correct answer?Notice that the original data (π₯) are linearly separable if and only if the signed data (π¦π₯) are all in the same half-plane.That means that there is some vector π€ such that sgn(π€!(π¦π₯)) = 1 for all of the data.
π¦π₯<
π¦π₯= π€
Does the perceptron find an answer?Does it find the correct answer?Suppose we start out with the wrong π€, so that one token is misclassified.
π¦π₯<
π¦π₯=π€
Does the perceptron find an answer?Does it find the correct answer?Suppose we start out with the wrong π€, so that one token is misclassified. Then we update w as
π€ = π€ + π¦π₯
π¦π₯<
π¦π₯=π€
π¦π₯
π¦π₯New π€
Does the perceptron find an answer?Does it find the correct answer?Suppose we start out with the wrong π€, so that one token is misclassified. Then we update w as
π€ = π€ + π¦π₯β¦and the boundary moves so that the misclassified token is on the right side.
π¦π₯<
π¦π₯=New π€
What if the data are not linearly separable?
β¦ well, then in that case, the perceptron algorithm with π = 1 never converges. The only solution is to use a learning rate, π, that gradually decays over time, so that the update ππ¦π₯ also gradually decays toward zero.
π¦π₯<
π¦π₯= π€
What about non-separable data?β’ If the data are NOT linearly separable, then the perceptron with
Ξ·=1 doesnβt converge.β’ In fact, thatβs what Ξ· is for. β’ Remember that π€ = π€ + Ξ·yπ₯.β’We can force the perceptron to stop wiggling around by forcing Ξ·
(and therefore Ξ·yοΏ½βοΏ½) to get gradually smaller and smaller.
β’ This works: for the π>? training token, set Ξ·= <@
.
β’ Notice: β@A<B <@
is infinite. Nevertheless, Ξ·= <@
works, because the yπ₯ tokens are not all in the same direction.
Outline
β’ A little history: perceptron as a model of a biological neuronβ’ The perceptron learning algorithmβ’ Linear separability, π¦π₯, and convergence
β’ It converges to the right answer, even with π = 1, if data are linearly separable
β’ If data are not linearly separable, itβs necessary to use π = 1/π.
β’ Multi-class perceptron
Multi-Class Perceptron
1
x1
xD
x2
InputWeights
.
.
.
Output:argmax$%&'(! π€$#οΏ½βοΏ½
True class is π¦ β {0,1,2, β¦ , π β 1}(i.e., π=vocabulary size = # of distinct classes).
Classifier output is
-π¦ = argmax$%&'(! π€$!π₯! +β―+ π€$)π₯) + π$
= argmax$%&'(! π€$#π₯
β {0,1, β¦ , π β 1}
Where π€$ = [π€$!, β¦ , π€$) , π$]#
and π₯ = [π₯!, β¦ , π₯) , 1]#
.
.
.
π&π!
π€&)
π'(!
π€!)π€'(!,)
argmax
.
.
.
Multi-Class Perceptron True class is π¦ β {0,1,2, β¦ , π β 1}(i.e., π=vocabulary size = # of distinct classes).
Classifier output is
-π¦ = argmax$%&'(! π€$!π₯! +β―+ π€$)π₯) + π$
= argmax$%&'(! π€$#π₯
β {0,1, β¦ , π β 1}
Where π€$ = [π€$!, β¦ , π€$) , π$]#
and π₯ = [π₯!, β¦ , π₯) , 1]#
π₯<
π₯=
By Balu Ertl - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=38534275
3π = 0
3π = 1 3π = 2 3π = 3
3π = 43π = 5 3π = 6
3π = 8
3π = 9
3π= 7
3π = 10
3π = 11 3π = 12 3π = 13
3π = 14
3π = 15 3π = 16 3π = 17
3π = 183π = 19
Multi-Class Linear ClassifiersAll multi-class linear classifiers have the form
#π¦ = argmax$%&'(" π€$!π₯
The region of x-space associated with each class label is convex with piece-wise linear boundaries. Such regions are called βVoronoi regions.β
π₯<
3π = 0
3π = 1 3π = 2 3π = 3
3π = 43π = 5 3π = 6
3π = 8
π₯=
By Balu Ertl - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=38534275
3π = 9
3π= 7
3π = 10
3π = 11 3π = 12 3π = 13
3π = 14
3π = 15 3π = 16 3π = 17
3π = 183π = 19
For each training instance π w/ground truth label π¦ β {0,1, β¦ , π β 1}:β’ Classify with current weights: ,π¦ = argmax!"#$%& π€!'π₯β’ Update weights: β’ if ,π¦ is correct (π¦ = ,π¦) then do nothingβ’ If ,π¦ is incorrect (π¦ β ,π¦) then:β’ Update the correct-class vector as π€N = π€) + Ξ·π₯β’ Update the wrong-class vector as π€*) = π€ON β Ξ·π₯β’ Donβt change the vectors of any other class
Training a Multi-Class Perceptron
Conclusions
β’ Perceptron as a model of a biological neuron: #π¦ = sgn(π€#π₯)β’ The perceptron learning algorithm: if π¦ = #π¦ then do nothing, else π€ = π€ + πyπ₯.β’ Linear separability, π¦π₯, and convergence
β’ It converges to the right answer, even with π = 1, if data are linearly separable
β’ If data are not linearly separable, itβs necessary to use π = 1/π.
β’ Multi-class perceptron: if π¦ = #π¦ then do nothing, else π€) = π€) + ππ₯, and π€*) = π€ON β ππ₯.