CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 1

CS 461: Machine LearningLecture 4

Dr. Kiri [email protected]

Dr. Kiri [email protected]

1/26/08 CS 461, Winter 2008 2

Plan for Today

Solution to HW 2 Support Vector Machines

Review of Classification Non-separable data: the Kernel Trick Regression

Neural Networks Perceptrons Multilayer Perceptrons Backpropagation

1/26/08 CS 461, Winter 2008 3

Review from Lecture 3

Decision trees Regression trees, pruning

Evaluation One classifier:

Errors, confidence intervals, significance Baselines

Comparing two classifiers: McNemar’s test Cross-validation

Support Vector Machines Classification

Linear discriminants, maximum margin Learning (optimization): gradient descent, QP Non-separable classes

1/26/08 CS 461, Winter 2008 4

Support Vector Machines

Chapter 10Chapter 10

1/26/08 CS 461, Winter 2008 5

Support Vector Machines

Maximum-margin linear classifiers

Support vectors

How to find best w, b?

€

min 1

2w

2 subject to y t wTx t + b( ) ≥ +1,∀ t

€

g x |w,w0( ) = wTx + w0 = w j

j=1

d

∑ x j + w0

Optimization in action:SVM applet

http://www.eee.metu.edu.tr/~alatan/Courses/Demo/AppletSVM.html

1/26/08 CS 461, Winter 2008 6

What if Data isn’t Linearly Separable?

1. Add “slack” variables to permit some errors [Andrew Moore’s slides]

2. Embed data in higher-dimensional space Explicit: Basis functions (new features) Implicit: Kernel functions (new dot product) Still need to find a linear hyperplane

1/26/08 CS 461, Winter 2008 7

Build a Better Dot Product:Basis Functions

Preprocess input x by basis functionsz = φ(x) g(z)=wTz

g(x)=wT φ(x)

€

g x( ) = wT x + b

= α t y tϕ x t( )

Tϕ x( )

t

∑ + b

= α t y tK x t ,x( )t

∑ + b

[Alpaydin 2004 The MIT Press]

1/26/08 CS 461, Winter 2008 8

Build a Better Dot Product:Basis Functions -> Kernel Functions

Linear Polynomial

RBF

Sigmoid

Optimization in action:SVM applet

€

g x( ) = wT x + b

= α t y tϕ x t( )

Tϕ x( )

t

∑ + b

= α t y tK x t ,x( )t

∑ + b

€

K x t ,x( ) = xTx t( )

( )⎥⎥

⎦

⎤

⎢⎢

⎣

⎡

σ

−−= 2

2

expxx

xxt

t ,K


( ) ( )( )

( ) [ ]T

T

x,x,xx,x,x,

yxyxyyxxyxyx

yxyx

,K

22

212121

22

22

21

2121212211

22211

2

2221

2221

1

1

=φ

+++++=

++=

+=

x

yxyx

( ) ( )12tanh += tTt ,K xxxx

http://www.eee.metu.edu.tr/~alatan/Courses/Demo/AppletSVM.html

1/26/08 CS 461, Winter 2008 9

Example: Orbital Classification

Linear SVM flying on EO-1 Earth Orbiter since Dec. 2004 Classify every

pixel Four classes 12 features

(of 256 collected)

Hyperion

Classified[Castano et al., 2005]

EO-1

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.





Snow

Land

Water

Ice

1/26/08 CS 461, Winter 2008 10

SVM Regression

Use a linear model (possibly kernelized)

f(x)=wTx+w0

Use the є-insensitive error function

€

eε gt , f x t( )( ) =

0 if y t − g x t( ) < ε

y t − g x t( ) −ε otherwise

⎧ ⎨ ⎪

⎩ ⎪

( )∑ −+ ξ+ξ+t

ttC2

21

min w

1/26/08 CS 461, Winter 2008 11

Neural Networks

Chapter 11Chapter 11

1/26/08 CS 461, Winter 2008 12

Perceptron


Graphical

[ ][ ]Td

Td

Td

jjj

x,...,x,

w,...,w,w

wxwy

1

10

01

1=

=

=+=∑=

x

w

xw

Math

1/26/08 CS 461, Winter 2008 13

Perceptrons in action


Regression: y=wx+w0 Classification:

y=1(wx+w0>0)

ww0

y

x

x0=+1

ww0

y

x

s

w0

y

x

1/26/08 CS 461, Winter 2008 14

“Smooth” Output: Sigmoid Function

€

y = sigmoid wTx( ) =1

1+ exp −wTx[ ]

€

1. Calculate g x( ) = wTx and choose C1 if g x( ) > 0, or

2. Calculate y = sigmoid wTx( ) and choose C1 if y > 0.5

Why?

• Converts output to probability!

• Less “brittle” boundary

1/26/08 CS 461, Winter 2008 15

K outputsRegression:

xy

xw

W=

=+=∑=

Tii

d

jjiji wxwy 0

1


kk

i

i

k k

ii

Tii

yy

C

oo

y

o

maxif

choose

expexp

=

=

=

∑

xw

Softmax

Classification:

1/26/08 CS 461, Winter 2008 16

Training a Neural Network

1. Randomly initialize weights2. Update =

Learning rate * (Desired - Actual) * Input

€

Δw jt = η y t − ˆ y t( )x j

t

1/26/08 CS 461, Winter 2008 17

Learning Boolean AND


€

Δw jt = η y t − ˆ y t( )x j

t

Perceptron demo

1/26/08 CS 461, Winter 2008 18

Multilayer Perceptrons = MLP = ANN


€

y i = viT z = v ihzh + v i0

h=1

H

∑

zh = sigmoid whTx( )

=1

1+ exp − whj x j + wh 0j=1

d

∑ ⎛ ⎝ ⎜ ⎞

⎠ ⎟ ⎡

⎣ ⎢ ⎤ ⎦ ⎥

1/26/08 CS 461, Winter 2008 19

x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2)


1/26/08 CS 461, Winter 2008 20

Backpropagation: MLP training

€

y i = viT z = v ihzh + v i0

h=1

H

∑

zh = sigmoid whTx( )

=1

1+ exp − whj x j + wh 0j=1

d

∑ ⎛ ⎝ ⎜ ⎞

⎠ ⎟ ⎡

⎣ ⎢ ⎤ ⎦ ⎥

∂E

∂whj

=∂E

∂y i

∂y i

∂zh

∂zh

∂whj

1/26/08 CS 461, Winter 2008 21

( )xwThhz sigmoid=

Backpropagation: Regression

€

Δwhj = −η∂E

∂whj

= −η∂E

∂ˆ y tt

∑ ∂ˆ y t

∂zht

∂zht

∂whj

= −η − y t − ˆ y t( )vh

t

∑ zht 1− zh

t( )x j

t

= η y t − ˆ y t( )vh

t

∑ zht 1− zh

t( )x j

t

Forward

Backward

x

∑=

+=H

h

thh

t vzvy1

0

€

E W,v | X( ) =1

2y t − ˆ y t( )

t

∑2

€

Δvh = η y t − ˆ y t( )t

∑ zht

1/26/08 CS 461, Winter 2008 22

( )xwThhz sigmoid=

Backpropagation: Classification

€

Δwhj = −η∂E

∂whj

= −η∂E

∂ˆ y tt

∑ ∂ˆ y t

∂zht

∂zht

∂whj

= −η − y t − ˆ y t( )vh

t

∑ zht 1− zh

t( )x j

t

= η y t − ˆ y t( )vh

t

∑ zht 1− zh

t( )x j

t

Forward

Backward

x

€

y t = sigmoid vhzht + v0

h=1

H

∑ ⎛

⎝ ⎜

⎞

⎠ ⎟

€

E W,v | X( ) = − y t log ˆ y t + 1− y t( ) log 1− ˆ y t( )

t

∑

€

Δvh = η y t − ˆ y t( )t

∑ zht

1/26/08 CS 461, Winter 2008 23

Backpropagation Algorithm

1/26/08 CS 461, Winter 2008 24

Examples

Digit Recognition Ball Balancing

1/26/08 CS 461, Winter 2008 25

ANN vs. SVM

SVM with sigmoid kernel = 2-layer MLP Parameters

ANN: # hidden layers, # nodes SVM: kernel, kernel params, C

Optimization ANN: local minimum (gradient descent) SVM: global minimum (QP)

Interpretability? About the same… So why SVMs?

Sparse solution, geometric interpretation, less likely to overfit data

1/26/08 CS 461, Winter 2008 26

Summary: Key Points for Today

Support Vector Machines Non-separable data: basis functions, the Kernel

Trick Regression

Neural Networks Perceptrons

Sigmoid Training by gradient descent

Multilayer Perceptrons Backpropagation (gradient descent)

ANN vs. SVM

1/26/08 CS 461, Winter 2008 27

Review for Midterm

1/26/08 CS 461, Winter 2008 28

Next Time

Midterm Exam! 9:00 - 10:30 a.m. Open book, open notes (no computer) Covers all material through today

Bayesian Methods (read Ch. 3.1, 3.2, 3.7, 3.9) If you’re rusty on probability, read Appendix A

Questions to answer from the reading Posted on the website (calendar)

Documents

CS 461: Machine Learning Lecture 4