Upload
addison
View
38
Download
5
Embed Size (px)
DESCRIPTION
CS 461: Machine Learning Lecture 4. Dr. Kiri Wagstaff [email protected]. Plan for Today. Solution to HW 2 Support Vector Machines Review of Classification Non-separable data: the Kernel Trick Regression Neural Networks Perceptrons Multilayer Perceptrons Backpropagation. - PowerPoint PPT Presentation
Citation preview
1/26/08 CS 461, Winter 2008 1
CS 461: Machine LearningLecture 4
Dr. Kiri [email protected]
Dr. Kiri [email protected]
1/26/08 CS 461, Winter 2008 2
Plan for Today
Solution to HW 2 Support Vector Machines
Review of Classification Non-separable data: the Kernel Trick Regression
Neural Networks Perceptrons Multilayer Perceptrons Backpropagation
1/26/08 CS 461, Winter 2008 3
Review from Lecture 3
Decision trees Regression trees, pruning
Evaluation One classifier:
Errors, confidence intervals, significance Baselines
Comparing two classifiers: McNemar’s test Cross-validation
Support Vector Machines Classification
Linear discriminants, maximum margin Learning (optimization): gradient descent, QP Non-separable classes
1/26/08 CS 461, Winter 2008 4
Support Vector Machines
Chapter 10Chapter 10
1/26/08 CS 461, Winter 2008 5
Support Vector Machines
Maximum-margin linear classifiers
Support vectors
How to find best w, b?
€
min 1
2w
2 subject to y t wTx t + b( ) ≥ +1,∀ t
€
g x |w,w0( ) = wTx + w0 = w j
j=1
d
∑ x j + w0
Optimization in action:SVM applet
1/26/08 CS 461, Winter 2008 6
What if Data isn’t Linearly Separable?
1. Add “slack” variables to permit some errors [Andrew Moore’s slides]
2. Embed data in higher-dimensional space Explicit: Basis functions (new features) Implicit: Kernel functions (new dot product) Still need to find a linear hyperplane
1/26/08 CS 461, Winter 2008 7
Build a Better Dot Product:Basis Functions
Preprocess input x by basis functionsz = φ(x) g(z)=wTz
g(x)=wT φ(x)
€
g x( ) = wT x + b
= α t y tϕ x t( )
Tϕ x( )
t
∑ + b
= α t y tK x t ,x( )t
∑ + b
[Alpaydin 2004 The MIT Press]
1/26/08 CS 461, Winter 2008 8
Build a Better Dot Product:Basis Functions -> Kernel Functions
Linear Polynomial
RBF
Sigmoid
Optimization in action:SVM applet
€
g x( ) = wT x + b
= α t y tϕ x t( )
Tϕ x( )
t
∑ + b
= α t y tK x t ,x( )t
∑ + b
€
K x t ,x( ) = xTx t( )
( )⎥⎥
⎦
⎤
⎢⎢
⎣
⎡
σ
−−= 2
2
expxx
xxt
t ,K
[Alpaydin 2004 The MIT Press]
( ) ( )( )
( ) [ ]T
T
x,x,xx,x,x,
yxyxyyxxyxyx
yxyx
,K
22
212121
22
22
21
2121212211
22211
2
2221
2221
1
1
=φ
+++++=
++=
+=
x
yxyx
( ) ( )12tanh += tTt ,K xxxx
1/26/08 CS 461, Winter 2008 9
Example: Orbital Classification
Linear SVM flying on EO-1 Earth Orbiter since Dec. 2004 Classify every
pixel Four classes 12 features
(of 256 collected)
Hyperion
Classified[Castano et al., 2005]
EO-1
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Snow
Land
Water
Ice
1/26/08 CS 461, Winter 2008 10
SVM Regression
Use a linear model (possibly kernelized)
f(x)=wTx+w0
Use the є-insensitive error function
€
eε gt , f x t( )( ) =
0 if y t − g x t( ) < ε
y t − g x t( ) −ε otherwise
⎧ ⎨ ⎪
⎩ ⎪
( )∑ −+ ξ+ξ+t
ttC2
21
min w
1/26/08 CS 461, Winter 2008 11
Neural Networks
Chapter 11Chapter 11
1/26/08 CS 461, Winter 2008 12
Perceptron
[Alpaydin 2004 The MIT Press]
Graphical
[ ][ ]Td
Td
Td
jjj
x,...,x,
w,...,w,w
wxwy
1
10
01
1=
=
=+=∑=
x
w
xw
Math
1/26/08 CS 461, Winter 2008 13
Perceptrons in action
[Alpaydin 2004 The MIT Press]
Regression: y=wx+w0 Classification:
y=1(wx+w0>0)
ww0
y
x
x0=+1
ww0
y
x
s
w0
y
x
1/26/08 CS 461, Winter 2008 14
“Smooth” Output: Sigmoid Function
€
y = sigmoid wTx( ) =1
1+ exp −wTx[ ]
€
1. Calculate g x( ) = wTx and choose C1 if g x( ) > 0, or
2. Calculate y = sigmoid wTx( ) and choose C1 if y > 0.5
Why?
• Converts output to probability!
• Less “brittle” boundary
1/26/08 CS 461, Winter 2008 15
K outputsRegression:
xy
xw
W=
=+=∑=
Tii
d
jjiji wxwy 0
1
[Alpaydin 2004 The MIT Press]
kk
i
i
k k
ii
Tii
yy
C
oo
y
o
maxif
choose
expexp
=
=
=
∑
xw
Softmax
Classification:
1/26/08 CS 461, Winter 2008 16
Training a Neural Network
1. Randomly initialize weights2. Update =
Learning rate * (Desired - Actual) * Input
€
Δw jt = η y t − ˆ y t( )x j
t
1/26/08 CS 461, Winter 2008 17
Learning Boolean AND
[Alpaydin 2004 The MIT Press]
€
Δw jt = η y t − ˆ y t( )x j
t
Perceptron demo
1/26/08 CS 461, Winter 2008 18
Multilayer Perceptrons = MLP = ANN
[Alpaydin 2004 The MIT Press]
€
y i = viT z = v ihzh + v i0
h=1
H
∑
zh = sigmoid whTx( )
=1
1+ exp − whj x j + wh 0j=1
d
∑ ⎛ ⎝ ⎜ ⎞
⎠ ⎟ ⎡
⎣ ⎢ ⎤ ⎦ ⎥
1/26/08 CS 461, Winter 2008 19
x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2)
[Alpaydin 2004 The MIT Press]
1/26/08 CS 461, Winter 2008 20
Backpropagation: MLP training
€
y i = viT z = v ihzh + v i0
h=1
H
∑
zh = sigmoid whTx( )
=1
1+ exp − whj x j + wh 0j=1
d
∑ ⎛ ⎝ ⎜ ⎞
⎠ ⎟ ⎡
⎣ ⎢ ⎤ ⎦ ⎥
∂E
∂whj
=∂E
∂y i
∂y i
∂zh
∂zh
∂whj
1/26/08 CS 461, Winter 2008 21
( )xwThhz sigmoid=
Backpropagation: Regression
€
Δwhj = −η∂E
∂whj
= −η∂E
∂ˆ y tt
∑ ∂ˆ y t
∂zht
∂zht
∂whj
= −η − y t − ˆ y t( )vh
t
∑ zht 1− zh
t( )x j
t
= η y t − ˆ y t( )vh
t
∑ zht 1− zh
t( )x j
t
Forward
Backward
x
∑=
+=H
h
thh
t vzvy1
0
€
E W,v | X( ) =1
2y t − ˆ y t( )
t
∑2
€
Δvh = η y t − ˆ y t( )t
∑ zht
1/26/08 CS 461, Winter 2008 22
( )xwThhz sigmoid=
Backpropagation: Classification
€
Δwhj = −η∂E
∂whj
= −η∂E
∂ˆ y tt
∑ ∂ˆ y t
∂zht
∂zht
∂whj
= −η − y t − ˆ y t( )vh
t
∑ zht 1− zh
t( )x j
t
= η y t − ˆ y t( )vh
t
∑ zht 1− zh
t( )x j
t
Forward
Backward
x
€
y t = sigmoid vhzht + v0
h=1
H
∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
€
E W,v | X( ) = − y t log ˆ y t + 1− y t( ) log 1− ˆ y t( )
t
∑
€
Δvh = η y t − ˆ y t( )t
∑ zht
1/26/08 CS 461, Winter 2008 23
Backpropagation Algorithm
1/26/08 CS 461, Winter 2008 24
Examples
Digit Recognition Ball Balancing
1/26/08 CS 461, Winter 2008 25
ANN vs. SVM
SVM with sigmoid kernel = 2-layer MLP Parameters
ANN: # hidden layers, # nodes SVM: kernel, kernel params, C
Optimization ANN: local minimum (gradient descent) SVM: global minimum (QP)
Interpretability? About the same… So why SVMs?
Sparse solution, geometric interpretation, less likely to overfit data
1/26/08 CS 461, Winter 2008 26
Summary: Key Points for Today
Support Vector Machines Non-separable data: basis functions, the Kernel
Trick Regression
Neural Networks Perceptrons
Sigmoid Training by gradient descent
Multilayer Perceptrons Backpropagation (gradient descent)
ANN vs. SVM
1/26/08 CS 461, Winter 2008 27
Review for Midterm
1/26/08 CS 461, Winter 2008 28
Next Time
Midterm Exam! 9:00 - 10:30 a.m. Open book, open notes (no computer) Covers all material through today
Bayesian Methods (read Ch. 3.1, 3.2, 3.7, 3.9) If you’re rusty on probability, read Appendix A
Questions to answer from the reading Posted on the website (calendar)