28
1/26/08 CS 461, Winter 2008 1 CS 461: Machine Learning Lecture 4 Dr. Kiri Wagstaff [email protected]

CS 461: Machine Learning Lecture 4

  • Upload
    addison

  • View
    38

  • Download
    5

Embed Size (px)

DESCRIPTION

CS 461: Machine Learning Lecture 4. Dr. Kiri Wagstaff [email protected]. Plan for Today. Solution to HW 2 Support Vector Machines Review of Classification Non-separable data: the Kernel Trick Regression Neural Networks Perceptrons Multilayer Perceptrons Backpropagation. - PowerPoint PPT Presentation

Citation preview

Page 1: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 1

CS 461: Machine LearningLecture 4

Dr. Kiri [email protected]

Dr. Kiri [email protected]

Page 2: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 2

Plan for Today

Solution to HW 2 Support Vector Machines

Review of Classification Non-separable data: the Kernel Trick Regression

Neural Networks Perceptrons Multilayer Perceptrons Backpropagation

Page 3: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 3

Review from Lecture 3

Decision trees Regression trees, pruning

Evaluation One classifier:

Errors, confidence intervals, significance Baselines

Comparing two classifiers: McNemar’s test Cross-validation

Support Vector Machines Classification

Linear discriminants, maximum margin Learning (optimization): gradient descent, QP Non-separable classes

Page 4: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 4

Support Vector Machines

Chapter 10Chapter 10

Page 5: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 5

Support Vector Machines

Maximum-margin linear classifiers

Support vectors

How to find best w, b?

min 1

2w

2 subject to y t wTx t + b( ) ≥ +1,∀ t

g x |w,w0( ) = wTx + w0 = w j

j=1

d

∑ x j + w0

Optimization in action:SVM applet

Page 6: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 6

What if Data isn’t Linearly Separable?

1. Add “slack” variables to permit some errors [Andrew Moore’s slides]

2. Embed data in higher-dimensional space Explicit: Basis functions (new features) Implicit: Kernel functions (new dot product) Still need to find a linear hyperplane

Page 7: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 7

Build a Better Dot Product:Basis Functions

Preprocess input x by basis functionsz = φ(x) g(z)=wTz

g(x)=wT φ(x)

g x( ) = wT x + b

= α t y tϕ x t( )

Tϕ x( )

t

∑ + b

= α t y tK x t ,x( )t

∑ + b

[Alpaydin 2004 The MIT Press]

Page 8: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 8

Build a Better Dot Product:Basis Functions -> Kernel Functions

Linear Polynomial

RBF

Sigmoid

Optimization in action:SVM applet

g x( ) = wT x + b

= α t y tϕ x t( )

Tϕ x( )

t

∑ + b

= α t y tK x t ,x( )t

∑ + b

K x t ,x( ) = xTx t( )

( )⎥⎥

⎢⎢

σ

−−= 2

2

expxx

xxt

t ,K

[Alpaydin 2004 The MIT Press]

( ) ( )( )

( ) [ ]T

T

x,x,xx,x,x,

yxyxyyxxyxyx

yxyx

,K

22

212121

22

22

21

2121212211

22211

2

2221

2221

1

1

+++++=

++=

+=

x

yxyx

( ) ( )12tanh += tTt ,K xxxx

Page 9: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 9

Example: Orbital Classification

Linear SVM flying on EO-1 Earth Orbiter since Dec. 2004 Classify every

pixel Four classes 12 features

(of 256 collected)

Hyperion

Classified[Castano et al., 2005]

EO-1

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Snow

Land

Water

Ice

Page 10: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 10

SVM Regression

Use a linear model (possibly kernelized)

f(x)=wTx+w0

Use the є-insensitive error function

eε gt , f x t( )( ) =

0 if y t − g x t( ) < ε

y t − g x t( ) −ε otherwise

⎧ ⎨ ⎪

⎩ ⎪

( )∑ −+ ξ+ξ+t

ttC2

21

min w

Page 11: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 11

Neural Networks

Chapter 11Chapter 11

Page 12: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 12

Perceptron

[Alpaydin 2004 The MIT Press]

Graphical

[ ][ ]Td

Td

Td

jjj

x,...,x,

w,...,w,w

wxwy

1

10

01

1=

=

=+=∑=

x

w

xw

Math

Page 13: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 13

Perceptrons in action

[Alpaydin 2004 The MIT Press]

Regression: y=wx+w0 Classification:

y=1(wx+w0>0)

ww0

y

x

x0=+1

ww0

y

x

s

w0

y

x

Page 14: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 14

“Smooth” Output: Sigmoid Function

y = sigmoid wTx( ) =1

1+ exp −wTx[ ]

1. Calculate g x( ) = wTx and choose C1 if g x( ) > 0, or

2. Calculate y = sigmoid wTx( ) and choose C1 if y > 0.5

Why?

• Converts output to probability!

• Less “brittle” boundary

Page 15: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 15

K outputsRegression:

xy

xw

W=

=+=∑=

Tii

d

jjiji wxwy 0

1

[Alpaydin 2004 The MIT Press]

kk

i

i

k k

ii

Tii

yy

C

oo

y

o

maxif

choose

expexp

=

=

=

xw

Softmax

Classification:

Page 16: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 16

Training a Neural Network

1. Randomly initialize weights2. Update =

Learning rate * (Desired - Actual) * Input

Δw jt = η y t − ˆ y t( )x j

t

Page 17: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 17

Learning Boolean AND

[Alpaydin 2004 The MIT Press]

Δw jt = η y t − ˆ y t( )x j

t

Perceptron demo

Page 18: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 18

Multilayer Perceptrons = MLP = ANN

[Alpaydin 2004 The MIT Press]

y i = viT z = v ihzh + v i0

h=1

H

zh = sigmoid whTx( )

=1

1+ exp − whj x j + wh 0j=1

d

∑ ⎛ ⎝ ⎜ ⎞

⎠ ⎟ ⎡

⎣ ⎢ ⎤ ⎦ ⎥

Page 19: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 19

x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2)

[Alpaydin 2004 The MIT Press]

Page 20: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 20

Backpropagation: MLP training

y i = viT z = v ihzh + v i0

h=1

H

zh = sigmoid whTx( )

=1

1+ exp − whj x j + wh 0j=1

d

∑ ⎛ ⎝ ⎜ ⎞

⎠ ⎟ ⎡

⎣ ⎢ ⎤ ⎦ ⎥

∂E

∂whj

=∂E

∂y i

∂y i

∂zh

∂zh

∂whj

Page 21: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 21

( )xwThhz sigmoid=

Backpropagation: Regression

Δwhj = −η∂E

∂whj

= −η∂E

∂ˆ y tt

∑ ∂ˆ y t

∂zht

∂zht

∂whj

= −η − y t − ˆ y t( )vh

t

∑ zht 1− zh

t( )x j

t

= η y t − ˆ y t( )vh

t

∑ zht 1− zh

t( )x j

t

Forward

Backward

x

∑=

+=H

h

thh

t vzvy1

0

E W,v | X( ) =1

2y t − ˆ y t( )

t

∑2

Δvh = η y t − ˆ y t( )t

∑ zht

Page 22: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 22

( )xwThhz sigmoid=

Backpropagation: Classification

Δwhj = −η∂E

∂whj

= −η∂E

∂ˆ y tt

∑ ∂ˆ y t

∂zht

∂zht

∂whj

= −η − y t − ˆ y t( )vh

t

∑ zht 1− zh

t( )x j

t

= η y t − ˆ y t( )vh

t

∑ zht 1− zh

t( )x j

t

Forward

Backward

x

y t = sigmoid vhzht + v0

h=1

H

∑ ⎛

⎝ ⎜

⎠ ⎟

E W,v | X( ) = − y t log ˆ y t + 1− y t( ) log 1− ˆ y t( )

t

Δvh = η y t − ˆ y t( )t

∑ zht

Page 23: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 23

Backpropagation Algorithm

Page 24: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 24

Examples

Digit Recognition Ball Balancing

Page 25: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 25

ANN vs. SVM

SVM with sigmoid kernel = 2-layer MLP Parameters

ANN: # hidden layers, # nodes SVM: kernel, kernel params, C

Optimization ANN: local minimum (gradient descent) SVM: global minimum (QP)

Interpretability? About the same… So why SVMs?

Sparse solution, geometric interpretation, less likely to overfit data

Page 26: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 26

Summary: Key Points for Today

Support Vector Machines Non-separable data: basis functions, the Kernel

Trick Regression

Neural Networks Perceptrons

Sigmoid Training by gradient descent

Multilayer Perceptrons Backpropagation (gradient descent)

ANN vs. SVM

Page 27: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 27

Review for Midterm

Page 28: CS 461: Machine Learning Lecture 4

1/26/08 CS 461, Winter 2008 28

Next Time

Midterm Exam! 9:00 - 10:30 a.m. Open book, open notes (no computer) Covers all material through today

Bayesian Methods (read Ch. 3.1, 3.2, 3.7, 3.9) If you’re rusty on probability, read Appendix A

Questions to answer from the reading Posted on the website (calendar)