Machine Learning Course

8/9/2019 Machine Learning Course

1/326

www.GetPedia.com

* The Ebook starts from the next page : Enjoy !
http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/


2/326

6.891 Machine Learning

6.891 Machine Learning and Neural

Networks


News
http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/


3/326


Review & Overview


Course Information
http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/


4/326


Grading Experiment!!!


Course Goals


5/326


Course Goals:

NIPS 1989


Course Goals:

Pitts and McCulloch, 1947


6/326


Goals: Analysis and Computation


What is Machine Learning?


7/326


Physical Laws

Newtons Measurements

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Force

Acceleration


Physical Laws: Theorize

F=ma

Ignore Errors &

Inconsistencies

Newtons Measurements

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Force

Acceleration


8/326


Different Types of Learned Relations

F = ma, pv = nrt


Some Notation

( )Tdxxx ,,, 21 =x

{ ,, 21 CCC=

( )Tyy ,, 21=y

jjx xor

j-th examplej-th example

jj yC &

jt


9/326


Additional Notation (Abusive!)

( )jj xyy = ( )jj xCC =

=j

jj xytlE )(( =

=otherwise

zifzl

1

00)(

2)( zzl = LossLoss


Example: Digit Recognition


10/326


11/326


Tremendous Variety


Hand Labeled Data


12/326


Final Performance


Differentiating Speech & Music

The Key Issue: Features!


13/326


Speech Recognition

Sound Now is the time...


Evaluation of Credit Risk


14/326


Digit Recognition in Detail

1 2

Perim

Width Height

Num Black Pixels


Rules for Classification

1 2

One of the featuregraphs here

)()( baxxC +=

otherwise

yify

0

01)(


15/326


UsingBayes Law

1 2

)(

)"2(")"2|"()|"2("

fFP

PfFPfFP

=

===

P F f

P F f P

P F f(" "| )

( |" ") (" ")

( )1

1 1

= =

=

=

P B AP A B P B

P A( | )

( | ) ( )

( )=


Combining Features

x1

x2


16/326



Networks


17/326


News


18/326


Review & Overview


19/326


Fitting a Curve to Data

0 0.5 1

0

0.2

0.4

0.6

0.8

1

{ }jj tx ,:Data

?? ??

)(xy

You are given 10 example data points. These are samples of physicalrelationship, perhaps including noise.

You challenge is to make prediction for this relationship

- Interpolation: between the example points - Extrapolation: beyond the data.

In principle there are an infinite number of functions that could be associatedwith this data our challenge is to pick one.

In the final analysis we may want to hedge our bets and return a probabilitydistribution of functions.


20/326


Polynomial Fitting

{ }jj

tx ,:Data

( ) ( ) ( )=+++=M

l

j

l

Mj

M

jj lxwxwxwwxy

1

10)(

{ },,:torWeight vec 10 ww=w

);( wjxy

Imagine that we are constraining ourselves to class of polynomials.

Each M-th order polynomial is parameterized by M+1 parameters.

The learning process, becomes a process by which we select values of W_i,

The dependency of the y(.) function on W will can be highlighted by thenotation y(x; w).


21/326


Graphical Representation

( )=M

l

lj

l

jxwxy )(

X

X0 X1 X2 X3 X9

y

w0 w1 w2 w3w9

...

Scalar Representation

High Dimensional

Non-linear Representation

- While the algebraic notation for y() is clear and specific, we will see thatsometimes is also useful to develop a graphical notation of both classifiersand regression functions.

- This idea was originally popularized in the neural network literature,wherein neural networks were almost always drawn out in their graphicalform.

- The graphical notation points out that an intermediaterepresentation for Xis formed (M+1 exponentiations). The resulting problem is then one oflearning the linear relationship between this high dimensional space and t.


22/326


( ) =j

jjtxylossE );(

2

1w

EmpiricalLoss

Ew

min =w Find Optimalweights

Choose the Best Polynomial

- What defines the best polynomial function? Perhaps it is the one which ismost consistent with the training data?

- Actually we would rather return the function which makes the bestpredictions on future data - unfortunately there may be no source for thisdata.

- For the time being lets assume that we want to find the function which bestagrees with training data the function with the lowest loss.


23/326


Simple Loss Functions Simplify Learning

( ) =j

jjtxyE 2);(

21 w( ) 2 =loss

( )( )

( )( ) 0);(

0);(22

1

==

==

j

ijjj

j

ijjj

i

xtxy

xtxyw

E

w

w

- Certain simple loss functions lead to learning algorithms which are easy toderive and inexpensive to compute.

- For example, squared loss can be solved by differentiating and setting thisto zero.

- The result is a set of linear equations that can be solved by inverting amatrix.


24/326


First order fit

0 0.5 1

0

0.2

0.4

0.6

0.8

1

( ) =j

jj txyE2

);(2

1w

The optimal

function minimizes

the residual error

- There is a pleasant physical analogy for the squared loss. The functionsare connected by springs to the data. The system is then allowed to relaxuntil the forces are balanced. The minimum energy solution is the one thatis closest to the training data.


25/326


Fitting Different Polynomials

0 0.5 1

0

0.2

0.4

0.6

0.8

1

0 0.5 1

0

0.2

0.4

0.6

0.8

1

1

2

0 0.5 1

0

0.2

0.4

0.6

0.8

1

0 0.5 1

0

0.2

0.4

0.6

0.8

1

3

4

0 0.5 1

0

0.2

0.4

0.6

0.8

1

9

0 0.5 1

0

0.2

0.4

0.6

0.8

1

6

Each order of polynomial leads to a different fit.

Higher order polynomials come closer to the training data.

The 9th order polynomial can fit the 10 datapoints perfectly.

Which of these is the most likely to generalize


26/326


Target Function

)2sin(4.05.0)( xxh +=

)(jj

xht =

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

The function that generated the data was not a polynomial at all.


27/326


28/326


Matlab Code

% Construct training data

train_in = [1:10]/10;

train_out = 0.5 + 0.4 * sin(2 * pi * train_in) + 0.1 * randn(size(train_in));

% fit a polynomial

order = 3

p = polyfit(train_in, train_out, order)

% construct a test set

test_in = [1:300]/300;

true_out = 0.5 + 0.4 * sin(2 * pi * test_in);

% compute the polynomial prediction

fit_out = polyval(p, test_in);

% plot the results

% first: training data

% second: test data

% third: predictionsplot(train_in, train_out, o, test_in, true_out, test_in, fit_out)

axis([0 1 -0.1 1.1])

Above is all the code used to generate the previous graphs.

As we can see Matlab allows us to explore issues in machine learningwithout much hacking.


29/326


First General Problem in Learning

There are a few general (grand challenge) problems in machine learning.Perhaps the most important is the problem of controlling complexity. Wehave seen that a simpler approximator can fit an unknown function betterthan a more complex approximator. Building a theory for this is one grandchallenge of learning.

Clearly this problem has been appreciated for a long time.

Vapniks statement is perhaps the most confusing. When he says theory

what he means is something like a learning algorithm. The learningalgorithm which fits 9th order polynomials to 10 datapoints is not falsifiable.

- No set of datapoints would fail to be fit perfectly by this data.


30/326


Overfitting in Classification

This is not to say that such problems are uniuqe to regression. Determiningdecision boundaries for classification is very similar.

We need to balance the complexity of the boundary against the accuracy on

training data.


31/326


32/326


Recall the probabilistic approach

P(F|C1) P(F|C2)


33/326


Probabilistic Approach

Thomas Bayes1702-1761

)(

)()|()|(

xXP

CCPCCxXPxXCCP kkk

=

======

)(

)()|()|(

xP

CPCxPxCP kkk =

),( xXCCP k ==

)2|(&)1|( CXPCXP


34/326


Probability Densities

==b

a

dxxXpbaXP )(]),[(

)()( xXpxpp(x) X ===

)(

)()|()|(

xXp

CCPCCxXpxXCCP kkk =

======

)()()|()|(

xpCPCxpxCP kkk =

]),[()( baXPdbdxXp ==

Probability Densities are necessary because for a continuous randomvariable the probability of every event is zero.

The density measures the slope of the cumulative distribution function.

Alternatively it is the probability per unit area (or length, or volume)

measured over an infinitesimal area.

Somewhat surprisingly the density used in the same way that the distributionfunction is used. In other words the probability distributionof the class givethe feature value can be found using the densities of the features.


35/326


Bayes Law for Densities

>

=otherwise

)|()|()(

2

211

xPxPxC

Duda & Hart, 1973

Class 1 Class 2

Just as before we can graph the conditional probability of class givenfeature. The functions are now continuous

Given the Bayes classification rule, a set of decision regions are defined.


36/326


Decisions

These analyses are easily generalized to:

- Multiple classes

- Multiple dimensions


37/326


38/326


39/326


Probabilistic Classification Review

P(F|C) & P(C) -> P(F,C)


40/326


Information Retrieval


41/326


Keyword Search Works Well


42/326


Nave Bayes Classifier

)|( ji CFPProbability of word i appearing

in a Doc from Class j

)( ji Docf 1 if Doc j has word i.

)|()|}({ jii

iji CfFPCfP ==

)(

)|()(

}){|(i

i

i

ji

i

ij

ijfFP

CfFPCP

fCP=

=

=


43/326


{Docs}#

i}wordcontainingDocs{#)( =iFP

Docs}{Training#

i}wordwithDocsTraining{#)|( =ji CFP

Potential Bug:

None of our Training Docs contain Mercedes

Estimating Probabilities


44/326



45/326


46/326


Density Estimation is Ambiguous


47/326


Impacts Classification


48/326


49/326

Machine Learning

Review & Overview

Machine Learning

Keyword Search Works Well


50/326

Machine Learning

Bayesian Text Classification

Probability of word i appearing

in a Doc from Class j

{ }

)|(

)|(

)|,,(

1

2211

ji

i

i

jN

j

cCfFP

cCffP

cCfFfFP

====

===

Assume

Independence

Assume

Independence

ijji pcCFP === )|1(

{ }

otherwise

d)(dW

d

k

ki

k

0

iwordcontainsif1:

documentsofcollectionA:

Machine Learning

Bayes Nets Show Dependencies

)|()|}({ jii

iji CfFPCfP ==

F1 F2 F3 F4 FN

C

...

)|( 1 CFP

Bayes Nets show the dependencies between RVs


51/326

Machine Learning

Classification Using Bayes Law

)(

)|()(

}){|(

=

i

i

ji

i

j

ijfP

cfPcP

fcP

DocumentsOther

CarsGerman

0

1

=

=

c

c

Machine Learning

Estimating Probability Distributions

ijji pcCFP === )|1(

{ }

=otherwise

df)(dW

d

k

kiki

k

0

iwordcontainsif1:

documentsofcollectionA:

p_ij?

p_ij


52/326

Machine Learning

Maximum Likelihood

{ }( ) ( ) { }( )

( )

( ) ( )( )

( ) ( )( )iijiij

k i

kiij

kiij

k i

ki

k

kNk

k

kk

nNpnp

fpfp

cfP

cffPcdPcdP

=

=

=

==

1

11

|

|||

0

0100

Machine Learning

Log Likelihood

( ) ( )( )

)1log()()log(

1log

ijiiji

iij

iij

pnNpn

nNpnpL

+=

=

0

1

1)(

1

)1log()(

)log(

=

+=

+

=

ij

i

ij

i

ij

ij

i

ij

ij

i

ij

pnN

pn

p

pnN

p

pn

p

L


53/326

Machine Learning

Maximum Likelihood

ij

ij

i

i

ij

ij

i

i

ij

i

ij

i

ij

i

ij

i

p

p

NnNn

p

p

nN

n

p

nN

p

n

pnN

pn

=

=

=

=+

11

1)(

1

)(

01

1)(1

N

np iij=

Machine Learning

{Docs}#

i}wordcontainingDocs{#)1( ==iFP

Docs}{Training#

i}with wordDocsTraining{#)|1( == ji CFP

Potential Bug:

None of our Training Docs contain Mercedes

Estimating Probabilities


54/326

Machine Learning

Prior Expectations

p(mercedes | GermanCars)?

Machine Learning

Bayesian Parameter Estimation

Maximum A PosteoriMaximum A Posteori

{ }( ) { }

{ }( )0

0

0|

)(,|,|

cdP

pppcdPcdpP

k

ijijk

kij =

Maximum LikelihoodMaximum Likelihood

{ } ijk

ij

pcdPp

,|max 0

This turns out to be more useful

for continuous parameters

This turns out to be more useful

for continuous parameters


55/326

Machine Learning

What is the right prior?

{ } { }

{ }( )ijkijijkkij

pcdP

pppcdPcdpP

,|max

)(,|max,|max

0

00

=

= MaximumLikelihood

{ }( ) ( ) ( )( )iijiijijknNpnppcdP = 1,| 0

Machine Learning

Probability of the parameters

{ }( ) ( ) ( )( )iijiijijkCNpCppcdP = 1,| 0


56/326

Machine Learning

Bayesian Estimation

{ }( ) { }

{ }( )0

0

0|

)(,|,|

cdP

pppcdPcdpP

k

ijijk

kij = ijijji ppcCFP === ),|1(

{ }( )= ijjkijijjiji dpcdpppcFPcFP ,|),|()|({ }

{ }( )

{ }( ){ }( )0

0

0

0

|

)(,|

|

)(,|

cdP

pppcdPp

cdP

pppcdPp

k

ijijkij

k

ijijk

ij

=

=

Machine Learning

Continued

{ }( )

{ }( )

( ) ( )[ ]

( ) ( )[ ]

=

=

ij

nN

ij

n

ij

ij

nN

ij

n

ijij

ijijijk

ijijijkij

ji

dppp

dpppp

dppppcdP

dppppcdPpcFP

ii

ii

1

1

)(,|

)(,|)|(

0

0


57/326


58/326

Machine Learning


Networks

Machine Learning

News


59/326

Machine Learning

Review & Overview

Machine Learning

Why Gaussians ?


60/326


61/326

Machine Learning

Recall: Bayes Decision Boundaries

Machine Learning

Descriminant Function


62/326

Machine Learning

Set Discriminants Equal

Machine Learning


63/326

Machine Learning

Bayesian Parameter Estimation

Machine Learning

Convergence of Probability


64/326

Machine Learning

Reminder: Why we are here

-2 -1 0 1 2 3 4 50

0.5

1

1.5

2

2.5

3

3.5

4

0.1781 0.2013

0.8293

0.8299

0.9217

0.9434

0.9851

1.0079

1.0539

1.5355

1.5621

1.5875

1.6015

2.1811

2.7845 2.7879

3.0956

3.8428

3.9562

4.0800

0.1781 0.2013

0.8293

0.8299

0.9217

0.9434

0.9851

1.0079

1.0539

1.5355

1.5621

1.5875

1.6015

2.1811

2.7845

2.7879

3.0956

3.8428

3.9562

4.0800

-1.2660 -0.8724

-0.8081

-0.6223

-0.1624

-0.1342

-0.1098

-0.0882

0.1258

0.1395

0.1914

0.2873

0.3409

0.3694

0.6093 0.6463

1.1217

1.1463

1.3021

1.3971

-1.2660 -0.8724

-0.8081

-0.6223

-0.1624

-0.1342

-0.1098

-0.0882

0.1258

0.1395

0.1914

0.2873

0.3409

0.3694

0.6093

0.6463

1.1217

1.1463

1.3021

1.3971

Machine Learning

Max Likelihood Gaussian

Mean: 0.16

StDev: 0.8

Mean: 2.2

StDev: 1.0


65/326

Machine Learning

Different Samples, Different Decisions

Concept: Variance

The Variation you observe

when training on different

independent training sets

Concept: Variance

The Variation you observe

when training on different

independent training sets

Machine Learning

Variance depends on data set size...

20 points 2000 points


66/326

Machine Learning

But when data gets more complex...

Machine Learning

Gaussian dont work well

Concept: Training Error

Error in your classifier

on the training set

Concept: Training Error

Error in your classifier

on the training set


67/326


68/326


69/326

Paul Viola 1999 Machine Learning 3

Review & Overview


Histogram -1.2660 -0.8724

-0.8081

-0.6223

-0.1624

-0.1342

-0.1098

-0.0882

0.1258

0.1395

0.1914

0.2873

0.3409

0.3694 0.6093

0.6463

1.1217

1.1463

1.3021

1.3971

-1.2660

-0.8724

-0.8081

-0.6223

-0.1624

-0.1342

-0.1098

-0.0882

0.1258

0.1395

0.1914

0.2873

0.3409

0.3694

0.6093

0.6463

1.1217

1.1463

1.3021

1.3971

Divide by N to

yield a probability

-1.5 -1 -0.5 0 0.5 1 1.5

0

0.5

1

1.5

2

2.5

3

3.5

4


70/326


Simple Algorithm

function counts = myhist(data, centers)

% Initialize counts

counts = zeros(size(centers));

numdata = size(data,1);

% For each datapoint compute distance to every center

for i = 1:numdata

diffs = data(i,1) - centers;

dists = diffs.^2;

[minval mindist] = min(dists); counts(mindist) = counts(mindist) + 1;

end


Histogram


71/326


Max Likelihood Gaussian

Mean: 0.16

StDev: 0.8

Mean: 2.2

StDev: 1.0


Histogram Flexibility is Adjustable

-2 - 1 0 1 2 3 4 50

1

2

3

4

5

6

7

8

-2 - 1 0 1 2 3 4 50

1

2

3

4

5

6

7

8

-2 - 1 0 1 2 3 4 50

1

2

3

4

5

6

7

8

-2 - 1 0 1 2 3 4 50

1

2

3

4

5

6

7

8

-2 - 1 0 1 2 3 4 50

1

2

3

4

5

6

7

8

-2 - 1 0 1 2 3 4 50

1

2

3

4

5

6

7

8


72/326


Histograms have lower bias


but higher variance.


73/326


Parzen: One Bump per Data Point

-1.5 -1 -0.5 0 0.5 1 1.5

0.1

0.2

0.3

0.4

0.5

0.6

0.7


Parzen Algorithm

function [func, range] = parzen(data, sigma)

range = splitrange(min(data), max(data), 500);

numdata = size(data, 1)

% For each point on range compute distance to every

% datapoint.

for i = 1:size(range, 2)

gaussvals = gauss(range(i) - data, 0, sigma);

func(i) = sum(gaussvals)/numdata;

end

plot(range, func)


74/326


Parzen and Histogram are Similar


All Three at Once

-1.5 -1 -0.5 0 0.5 1 1.5

0.1

0.2

0.3

0.4

0.5

0.6

0.7


75/326


Properties of Non-parametric Techniques


Semi-Parametric Models


76/326


11,Gaussian

22 ,Gaussian

Flip

Coin

),|( 11xp

),|( 22xp

jx

jx

1)2()1( ==+= kPkP

???)(xp


Events are Disjoint -> They add

)2(),|()1(),|(

)2()2|()1()1|(

)2,()1,(

2211 =+==

===+====

==+===

JPxpJPxp

JPJxXpJPJxXp

JxXpJxXp

)( xXp =


77/326


Face Detection


Generating Training Data

Sung &

Poggio


78/326


Results

But, it takes minutes per image


Face Detection


79/326


Events are Disjoint -> They add

)2(),|()1(),|(

)2()2|()1()1|(

)2,()1,(

2211 =+==

===+====

==+===

JPxpJPxp

JPJxXpJPJxXp

JxXpJxXp

)( xXp =


Expectation Maximization

=

j

j

j

jj

kxkP

xxkP

)|(

)|(

( )

=

j

j

j

k

jj

kxkP

xxkP

)|(

)|(2

2

=j

jxkPkP )|()(

( )},,{log kkk qlE =

Bounded Below?

Decreases?


80/326

Paul Viola 1999 Machine Learning

News


Distribution of Grades: Pset 1


81/326


Review & Overview


Where are we?


82/326


Between density and classification


Gaussians vs. Discriminants

N2

N

o

Twxxy += w)( Two Class Gaussian

same Covariance

Two Class Gaussian

same Covariance


83/326


Linear Discriminant

o

Twxxy += w)(

X0 X1 X2 Xd

y

w0 w1 w2

wd

=

==N

i

ii

Txwxxy

0

)( w

Bias

Warning!


Multiple Discriminants

o

Twxxy 111 )( += w

0

0)()(

)()(

0

201021

202101

21

=+

=+

+=+

=

wx

wwx

wxwx

xyxy

T

T

TT

w

ww

ww

o

Twxxy 222 )( += w


84/326


in a single network

+=i

koikik wxwxy )(

matrixweightwki

)(maxarg)( xykifCxC ii

k ==


Multiple Discriminants

Intersection

of Half Planes


85/326


86/326


Quadratic cost is very simple

( ) ( )

TTTWXXWXW

TXWTXWWE

TTTTT

T

+=

=

2

)(( )

( )2

2)()(

=

=

j

jjT

j

jj

tx

txyE

w

w

( ) TXXXW

TXXWX

TXXWXdW

WdE

TT

TT

TT

1

022)(

=

=

==


Is this a model for the brain?

Pitts and McCulloch, 1947


87/326


88/326


Cant Always Solve for the Weights

)()( xwgxyT=

=otherwise

aifag

0

01)(

=

otherwise

aifag

1

01)(


Perceptron


89/326


Perceptron Cost Function

( )

( )w

xwgtxwg

w

wE

txwgE

jT

j

jjT

j

jjT

=

=

)()(2

)(

)()(2

w

( )

=

=

=

errors

jj

errors

jT

j

jTjjT

xtw

wE

txw

txwtxwgE

2)(

)(

)()()(2

w

Perceptron

Criterion

jj

tt xtww = 1

Simple Gradient

Descent does not work


Different Error Measures


90/326


Perceptron Learning

X0 X1 X2 Xd

y

w0 w1 w2wd


Real Perceptrons


91/326



A classic problem...

x x

x

x x

x x

x

x x

o o

o

oo o

o

oo

o o

o

oo o

o

oo


92/326

1



Networks

Lecture 7:

Multi-Layer Perceptrons

Back Propagation


Newsl Pset 3 is on the web

Includes a classifier shootout

The mystery dataset has 20 dimensions and two classes

Winner gets $10 of Toscaninis

l Pset 2 looks great Many of you did a lot of work.


93/326

2


Review & Overview

l Lecture 6: Linear Discriminants

Perceptrons

Training Perceptrons

l Generalized Perceptrons

l Multi-layer Perceptrons Multi-Layer Derivatives

Back Propagation

l Examples: NET Talk


On-line learning of Perceptrons

jj

t

j

tt xww

Eww =

= 11

X0 X1 X2 Xd

y

w0 w1 w2wd

l Pick Random Example

l Observe Output/Errorl Adjust Weights to

Reduce Error

=j

jEE )(w1: Error Function

(Criteria)

2: Update Rule


94/326

3


Different Criteria

( )

=

=

=

errors

jj

errors

jT

j

jTjjT

xtw

wE

txw

txwtxwgE

2)(

)(

)()()(2

w( )2)( =j

jjTtxE ww

( )

=

=

j

jj

j

jjjT

x

xtxw

E

2

2)(

ww

jj

tt xww = 1

jj

tt xtww = 1

Linear Discriminant Perceptron


Normalizing examples

j

tt xww = 1For errors

only!


95/326

4


The update rule in action...

j

tt xww = 1


Real Perceptrons


96/326

5


A classic problem...

x x

x

x x

x x

x

x x

oo

o

o o o

o

oo

oo

o

o o o

o

oo


Generalized Perceptron

2121 2)( xxxxxXOR += }1,0{ix

)()( xwgxy T= Cant do that!

=21

2

1

xx

x

x

x

=

1.2

1

1

w WorksGreat


97/326


98/326


99/326

8


Multiple Layers

=W

=

5.2110

0115.1

0000

0000

0000

W

How can we learn this??X1 X2

y

w54

w52 w53

1

w41 w42w43

u1 u2 u3

u4


1986: The Rebirth of Neural Networksl PDP Group had

Huge Impact


100/326

9


1980s: Perhaps Gradient Descent?

)()( xwsxy T=

aeas +

=1

1)(

( )

( )w

xwstxws

w

wE

txwsE

jT

j

jjT

j

jjT

=

=

)(

)(2)(

)()(2

w

( )w

uusus

w

us

=

)(1)()(

( ) ( ) jjTjTj

jjTxxwsxwstxws

w

wE)(1)()(2

)( =


Sigmoid Multi-Layer Network

))(

)((

)()(

35325215165

34324214154

565454

uwuwuwsw

uwuwuwsws

uwuwsxy

+++++=

+=

( )

( ) wxws

txwsw

wE

txwsE

jT

j

jjT

j

jjT

=

=

)(

)(2)(

)()(2

w

X1 X2

y

w64

w52

w65

1

w41w42

w43

u1 u2 u3

u4u5

u6

w53w51

w10


101/326

10


Multi-Layer Conventions

=

j

jkjk uwgu

=j

jkjk uwa

l Networks must not haveloops

l

Units are ordered: i > k --> uiis not an input to ukl Compute Units in order

X1 X2

y

w64

w52

w65

1

w41w42

w43

u1 u2 u3

u4u5

u6

w53w51


More Conventions

l If Units are organized in Layers Layers can be computed in parallel.

= 1212 * uWu g

= 2323 * uWu g


102/326

1




103/326

12


Solving XOR (big deal?)


Vision Applications (sort of)

Ts vs. Cs


104/326

13


Very Simple Solution


NETtalk (1986) First Real Applicationl Task: Pronounce English text

Text -> Phonemes

l Example: This is it. /s/ vs. /z/

l 29 possible characters

l 26 phonemes

l 7 Character Window

l Structure 203 Inputs

26 Output

80 Hidden

95% Accurate


105/326

1



Networks

Lecture 8:

Back Prop and Beyond


Newsl Mid-term will be on 10/20

Here in this room.

It should take about 1 hour but we will give you 1.5 Show up on time, please.

Coverage: Psets 1, 2 and 3.

Density estimation (Parametric, Semi and Non-parametric)

Bayesian Classification

Discriminants (Linear, Perceptron, Multi-layer)


106/326

2


Review & Overview

l Lecture 7: Multi-Layer Derivatives

Back Propagation

Examples: NET Talk

l Why 6.891 is not Over! Bugs with Gradient Descent

Local Min

Bias and Variance How many units?

Variants


Face Detection Network:

General Layout

Baluja, Rowley, and Kanade


107/326

3


Intensity Preprocessing


Training Data

Positives

Negativ

es


108/326

4


Performance



109/326

5


MLP: How Powerful?


Derivatives are Cheapl Modeling Factories/Plants

PlantInput

Control

Materials

Outputs

Products

MLP

Derivatives

Train with Back Prop

Use Derivs to Modify input to improve output


110/326

6


1990: The height of MLPs and Back Prop

l Multi-layer perceptrons can solve anyapproximation problem (in principle) Given 3 layers

Given and infinite number of units and weights

l There is no direct technique for finding theweights (unlike linear discriminants)

l Gradient descent (using Back Prop) comes todominate discussion in the Neural Net community Can you find a good set of weights quickly?

How can you speed things up? Will you get stuck in local minima?

l A small group in the community also worries aboutgeneralization.


How long til we find the Min?l Simplest Case

1 Weight, Quadratic Error Function w

y

x

2

2

2

)0(

)()(

w

w

ywxwE

=

=

=

1

1

=

t

ttw

Eww

ww

E

t

21

=

1

2

1=


111/326

7


Scale the Input

l Simplest Case 1 Weight, Quadratic Error Function w

y

x

2

2

4

)02()(

w

wwE

=

=

1

1

=

t

ttw

Eww

ww

E

t

81=

2

Hack 1: Start eta very small

Increase if Error decreasesMore Hacks Coming!


Multiple Weights

0.020

0.047

0.049

0.050

Hack 2: Momentum


112/326

8


Momentum

1

1

+

= t

t

t ww

Ew


l Gradient descent assumes a locally linear costfunction:

l Second order techniques assume locally quadratic:

Second Order Techniques

Ew

Ew

t

t =

=

1 ( )2

)()(

)()()(

wEwE

wEwEwE

=

++=+

E

Ewt

=

a

bw

E

E

aE

bawE

cbwawwE

2

2

2

)( 2

+=

=

+=

++=

( )

ab

abww

www

2

200

01

=

+=

+=


113/326


114/326


115/326

1


No Hands Across America


Zip Codes

Le Cun


116/326

1



Networks

Lecture 9:

On to Support Vector Techniques


Newsl Final will be 12/13 at 1:30PM

If you have a conflicting final let us know.

l Remember that almost all the material appears inthe book Right now we are jumping back and forth between

Chapter 5

Chapter 6


117/326


118/326

3


Why did we need multi-layer

perceptrons?

l Problems like this seem to require very complexnon-linearities.

l Minsky and Papert showed that an exponential

number of features is necessary to solve genericproblems.


Why an exponential number of features?

=M

4

2

3

21

2

2

2

1

2

2

2

12

3

1

4

1

5

2

4

21

3

2

2

1

2

2

3

12

4

1

5

1

,,,,,

...,,,,,

)( xxxxxxxxxx

xxxxxxxxxx

x

14th Order???

120 Features

( )),min(!!

)!( nk knOnk

kn

k

kn +=

+

polyorder:

variables:

k

n

N=21, k=5 --> 65,000 featuresN=21, k=5 --> 65,000 features


119/326

4


MLPs vs. Perceptron

l MLPs are incredibly hard to train Takes a long time (unpredictably long)

Can converge to poor minima

l MLP are hard to understand What are they really doing?

l Perceptrons are easy to train Type of linear programming. Polynomial time.

One minimum which is global.

l Generalized perceptrons are easier to understand. Polynomial functions.


Perceptron Training

is Linear Programming

lwi

l

ii > 0x After Normalization

After adding bias

Assumes no errors

Polynomial time in the number of variables

and in the number of constraints.

lsw li

l

ii >+ 0x

lsl > 0l

lsmin

What about linearly inseparable?


120/326

5


Rebirth of Perceptrons

l How to train efficiently. Linear Programming ( later quadratic programming)

l How to get so many features inexpensively?!?

l How to generalize with so many features? Occams revenge.

Support Vector Machines


Lemma 1: Weight vectors are simple

l The weight vector lives in a sub-space spanned bythe examples Dimensionality is determined by the number of examples

not the complexity of the space.

t

tw x=00=w

==l

l

l

errors

t

t bw xx =l

l

lt bw )(x


121/326

6


Lemma 2: Only need to compare examples


Perceptron Rebirth: Generalizationl Too many features Occam is unhappy

Perhaps we should encourage smoothness?

lsKb lj

j

lj >+ 0),( xx

lsl > 0

l

lsmin

j

jb2min

Smoother

But this is unstable!!


122/326

7


The linear could return any multiple of the correct

weight vector...

lwi

l

ii > 0 x ( ) lwi

l

ii > 0 x

Linear Program is not unique

lswl

i

l

ii >+0x

lsl > 0l

l

smin

i

iw2

min

Slack variables & Weight prior

- Force the solution toward zero


Definition of the Margin

l

Margin: Gap between negatives and positivesmeasured perpendicular to a hyperplane


123/326


124/326


125/326

1


SVM: examples


SVM: Key Ideasl Augment inputs with a very large feature set

Polynomials, etc.

l Use Kernel Trick(TM) to do this efficiently

l Enforce/Encourage Smoothness with weight penalty

Minimize

l Introduce Margin so that:

Set of linear inequalities

l Find best solution using Quadratic Programming

=i

i

T www 2

iwi 0

)(x


126/326

1


SVM: Difficulties

l How do you pick the kernels? Intuition / Luck / Magic /

l What if the data is not linearly separable? i.e. the constraints cannot be satisfied

j

i

ij

i

jcxKwtj

1),()12(

+ j jT cww ||min Slack Variables


SVM: Simple Example

l

Data dimension: 2l Feature Space: 2nd order polynomial

4 dimensional

6 weights


127/326


128/326

1


Zip Codes

Much Effort spent on

organizing the network


SVM: Zip Code recogntion

l Data dimension: 256

l Feature Space: 4 th order roughly 100,000,000 dims


129/326

1


SVM: Faces


Support Vectors


130/326

1



Networks

Lecture 10:


More Details and Derivations


Newsl Quiz is 1 week from today.

l Problem set 4 will go out right after the quiz In one week its a pain to do two things at once.

l Problems set are very good (once again). On an absolute scale many of you are getting As.


131/326

2


Pset 2


Review & Overviewl Lecture 9:

Resurrecting Perceptrons

Setting up Support Vector Machines

l SVM review

l Why is it called Support Vectors??

l Derivation of some simpler properties.


132/326

3


SVM: Key Ideas

l Augment inputs with a very large feature set Polynomials, etc.

l Use Kernel Trick(TM) to do this efficiently

l Enforce/Encourage Smoothness with weight penalty

Minimize

l Find best solution using Quadratic Programming

=i

i

T bbb 2 ibi 0

)(x

Avoid!


Support Vectors

l Many of the bs are zero -- inactive constraints Only keep examples where

l Likely to generalize well VC Dimension -- later in the semester

1),(

i

ij

i cxKbj

constraintsubject to)min( wwT

=

i

i

i cxKbxy ),()(

0ib


133/326


134/326

5


The optimal dividing line

l The optimal separatormaximizes the marginbetweenpositive and negative examples

iT

negativesxwd max=

iT

positivesxwd min=+

( )||

maxmarginmaxw

dd

ww

+=

||margin

w

dd +=


Definition of the Margin

l

Margin: Gap between negatives and positivesmeasured perpendicular to a hyperplane


135/326

6


Optimal dividing line=Support Vectors

iT

negativesxwd max=

iT

positivesxwd min=+

||max

w

dd

w

+

1 iTnegatives

xw

1 iTpositives

xw

wwT

min


Lemma 1: Weight vectors are simple

l The weight vector lives in a sub-space spanned bythe examples

l Proved this to you by analyzing the perceptronweight update rule But we no longer use that rule!!!

Instead we use Quadratic Programming

=l

l

lbw x =l

l

lbw )(x


136/326

7


Lemma 1: Kuhn-Tucker Conditions

1

1

1

3

2

1

xw

xw

xw

T

T

T

( )wwTmin

11

2

1

==

xw

xw

T

T

2

2

1

1 xxw bb +=

Some of the examples do not contribute

the inactive constraints.


SVM versus Perceptronl Why not just use a perceptron?

Use all training points as a centers

Update using perceptron rule:

l

Perceptrons do not maximize the margin The estimated function is not terribly smooth

l Perceptrons do not rely on very few support vectors Yields a much more efficient classifier.

=

i

iT

i cxKwxy ),()(

),())((i

ii cxKxytww +=


137/326


138/326

9


Support Vectors


SVM: Difficultiesl How do you pick the kernels?

Intuition / Luck / Magic /

l What if the data is not linearly separable? i.e. the constraints cannot be satisfied

1),( +

j

i

ij

i scxKbj

+ j jT scbb ||min Slack Variables


139/326


140/326


141/326


142/326


143/326

3


Optimal dividing line=Support Vectors

iT

negativesxwd max=

iT

positivesxwd min=+

||max

w

dd

w

+

1 iTnegatives

xw

1 iTpositives

xw

wwT

1max

||

1

wd

= ||

1

wd =+

wwww

ww

w

ddT

1

||

1

||

||

1

||

1

|| 2==

=

+

wwT

min


Kernel Networks are Good for Regression

l The form of the Kernel determines the form ofthe final function Polynomial Kernels -> Polynomial Function

Gaussian Kernels -> Sum of Gaussians

l The Common Error Criteria is squared error

=

i

i

i cxKbxy ),()( =i

i

icxKbxy ),()(

( ) =j

jj txyError2

)(

This ends up being exactly like polynomial fitting

except that there is one weight per data point

This ends up being exactly like polynomial fitting

except that there is one weight per data point


144/326

4


Radial Basis Function Networks

l When we restrict ourselves to Kernels which areradially symmetric, the resulting network is calleda Radial Basis Function Network K only depends on the radial distance from some

datapoint c.

Poggio & Girosi pioneered the use of these.

|)(|),( cxKcxK = =i

i

i cxKbxy |)(|)(


2 4 6 8 100

0.5

1

1.5

2

2.5

3

rom Smoothness to Kernels:

Assumptions are Necessary

0 2 4 6 8 101

1.5

2

2.5

3

Intuition

? ?


145/326

5


Setting up the problem

W = 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

T =1

0 0 0

3 0 0 0 0 1

TWWWY TT 1)( =

( )2TWYCost =

Not Invertible!


Conditioning the Problem

TWIWWY TT 1)( +=

( ) YYTWYCost T+= 2 =i

i

TyYY

2

Small solution

vectors are best.Y =

1 0 0 0 3 0 0 0 0 1

T =1

0 0 0 3 0 0 0 0 1

2 4 6 8 100

0.5

1

1.5

2

2.5

3


146/326

6


and the winner is?

2 4 6 8 100

1

2

3

4

5

6

2 4 6 8 100

0.5

1

1.5

2

2.5

3

2 4 6 8 100

0.5

1

1.5

2

2.5

3

This is not always true

remember to think like a

Bayesian


Smooth is Good: Regularization

l Alternative way to motivate Kernel Networks.

( )

Cost y Error y Smoothness y

y x tx

y x dxj j

j

( ) ( ) ( )

( )$

( $) $

= +

= +

2

2


147/326


148/326

8


Need to find lambda

Y=1.0000

1.5000 2.0000 2.5000 3.0000 2.6000 2.2000 1.8000 1.4000 1.0000

001.0=

2 4 6 8 100

0.5

1

1.5

2

2.5

3

10=

Y = 1.6000 1.6600

1.7200 1.7800 1.8400 1.7840 1.7280 1.6720 1.6160 1.5600

0.3

2 4 6 8 100

0.5

1

1.5

2

2.5

3

1.8


Derivative Order Controls Shape

1 0 2 0 30 40 500

0. 5

1

1. 5

2

2. 5

3

1 0 2 0 30 40 500

0. 5

1

1. 5

2

2. 5

3


149/326


150/326

10


Still Piecewise Cubic

10 20 30 40 50

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

10 20 30 40 50

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Smoothness is easily controlled

10 20 30 40 50

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

10 20 30 40 50

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

10 20 30 40 50

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

10 20 30 40 50

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

10 20 30 40 50

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

10 20 30 40 50

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9


151/326

11


Regularization to RBFs

l Alternative way to motivate RBFs

)()()( ySmoothnessyErroryE +=

Gaussian

Every

Training Point


roblem: To many centers

(Old solutions )

l One per data point can be way to many

l Choose a random subset of the points Hope you dont get unlucky

l Distribute them based on the density of points Perhaps EM clustering


152/326

12


Too Many Centers 2

l Put them where you need them To best approximate your function

l Compute the derivative of E(y) w.r.t. the centers This gets very hairy and does not work well

Too many local minima -- no small weight trick


Too Many Centers 3

l Support Vector Regression Next time.


153/326

1



Networks

Lecture 12:

Smooth Functions and Kernel Networks


Newsl Quiz was too hard

I am trying to come up with a creative grading scheme.

Best 5 out of 6 problems???

First let us do the grading.

l Problem set will be out by tonight.


154/326

2


Review & Overview

l Lecture 11: Trying to find smooth functions.

l Smooth Regression Another way of motivating Kernel networks


where we were last time.

Squared 1st

Derivative

2 4 6 8 1 0

0

2

4

6

8

10

12

14

16

Sum = 49.7

2 4 6 8 1 00

1

2

3

4

5

6

7

8

9

Sum = 20

2 4 6 8 100

0.5

1

1.5

2

2.5

3

Sum = 1.8

0 2 4 6 8 101

1.5

2

2.5

3


155/326

3


Regression Review

l Up until now we have been mostly analyzingclassification:

X, inputs. Y, classes. Find the best c(x).

l Today: Regression.

X, inputs. Y, outputs. Find the bestf(x).

Predict the stocks value next week.

Picture of Road -> Car steering wheel

etc.

( )2),(min j

jj

wywxf


Schemes for motivating regressionl Prior assumptions

Find the best polynomial which fits the data:

Or find the best neural network, or ???

l Bayesian Approach Find the most likely function:

K+++= 2210

),( xwxwwwxf ( )2

),(min j

jj

wywxf

( )},{)()|},({

max}),{|(maxjj

jj

f

jj

f yxp

fpfyxpyxfp =


156/326

4


Bayesian framework captures

many approaches

( )},{)()|},({

max}),{|(maxjj

jj

f

jj

f yxp

fpfyxpyxfp =

( )},{log)(log)|},({logmax jjjjf

yxpfpfyxp +

( )2)(

))((log

)|,(log

)|,(log)},({log

=

=

=

=

j

jj

j

jj

j

jj

j

jjjj

yxfc

yxfG

fyxp

fyxpfyxp

=otherwise0

polyaisif)(

ffp

The polynomial that

fits the data best is the

most likely function


Bayesian framework captures

many approaches

( )2)(

)},({log

=j

jj

jj

yxfc

fyxp

=

2

)(logx

ffp

The function which both fits the data and has

a small derivative is the most likely

=

2

2

2

)(log x

f

fpAlso popular


157/326

5


A closer look...

( )

+=2

2)()(

x

fyxfcfC

j

jj

=

=

1

3

1

10

5

1

YX

Data

l How do we minimize this function? The set of possible functions is infinite

The space of functions is infinite dimensional

l Constrain f to be: polynomial, sum of exponentials, etc??

l What about unconstrained solutions...


l Butfis not a scalar!!

l In factfis more like an infinite dimensional vector.

l If f were a finite vector:

0)( =

df

fdC

A slightly simpler problem

0)( =

jf

fCj

Can not reduce C()

by adjusting any of the

parameters of f.

( )

+=2

2)()(

x

fyxfcfC

j

jj


158/326

6


We could approximate f .

0 0 . 1 0 . 2 0. 3 0.4 0.5 0. 6 0.7 0.8 0.9 1

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0 . 6 0. 7 0.8 0.9 1

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0 . 6 0.7 0.8 0. 9 1

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 0.1 0. 2 0.3 0.4 0.5 0 . 6 0. 7 0.8 0.9 1

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

10 20

100 1000


2 4 6 8 100

0.5

1

1.5

2

2.5

3

Looking for smooth solutions...

0 2 4 6 8 101

1.5

2

2.5

3

Intuition

? ?


159/326

7


Setting up the problem

W = 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Y =1

0 0 0 3 0 0 0 0 1

YWWWF TT 1)( =

( )2YWFCost =

Not Invertible!

F =1

22 0 0 3 0 0 6 0 1

=


Conditioning the Problem

YWIWWF TT 1)( +=

( ) FFYWFCost T+= 2 =i

i

TfFF

2

Small solution

vectors are best.F =

1 0 0

0 3 0 0 0 0 1

Y =1

0 0

0 3 0 0 0 0 1

2 4 6 8 100

0.5

1

1.5

2

2.5

3


160/326

8


and the winner is?

2 4 6 8 100

1

2

3

4

5

6

2 4 6 8 100

0.5

1

1.5

2

2.5

3

2 4 6 8 100

0.5

1

1.5

2

2.5

3

This is not always true

remember to think like a

Bayesian


Smooth is Good: Regularization

l Alternative way to motivate Kernel Networks.

( ) xdxfx

yxf

fSmoothnessfErrorfCost

j

jj )(

)(

)()()(

22

+=

+=


161/326

9


Setting up the Problem

W = 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Y =1

0 0 0

3 0 0 0 0 1

D = 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0

0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1

YWDDWWF TTT 1)( +=

( ) ( )22 DFTWFCost +=

F=1.0000

1.5000 2.0000 2.5000

3.0000 2.6000 2.2000 1.8000 1.4000 1.0000


Need to find lambda F=1.0000

1.5000 2.0000 2.5000 3.0000 2.6000 2.2000 1.8000 1.4000 1.0000

001.0=

2 4 6 8 100

0.5

1

1.5

2

2.5

3

10=

F = 1.6000

1.6600 1.7200 1.7800 1.8400 1.7840 1.7280 1.6720 1.6160 1.5600

0.3

2 4 6 8 100

0.5

1

1.5

2

2.5

3

1.8


162/326

10


Derivative Order Controls Shape

1 0 2 0 30 40 500

0. 5

1

1. 5

2

2. 5

3

1 0 2 0 30 40 500

0. 5

1

1. 5

2

2. 5

3

2

2

2

x

f

2

x

f


A Closer Look

1 0 2 0 30 40 50

-1

-0.5

0

0. 5

1

1. 5

2

2. 5

3

1 0 2 0 30 40 50

-0.5

0

0. 5

1

1. 5

2

2. 5

3

Linear + Kinks Cubics + Kinks


163/326

11


Look at the regularizer...

D = 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1

YWDDWWF TTT 1)( +=

( ) ( )22 DFTWFCost +=

D * D =

1 -1 0 0 0 0 0 0 0 0

-1 2 -1 0 0 0 0 0 0 0

0 -1 2 -1 0 0 0 0 0 0

0 0 -1 2 -1 0 0 0 0 0

0 0 0 -1 2 -1 0 0 0 0 0 0 0 0 -1 2 -1 0 0 0

0 0 0 0 0 -1 2 -1 0 0

0 0 0 0 0 0 -1 2 -1 0

0 0 0 0 0 0 0 -1 2 -1

0 0 0 0 0 0 0 0 -1 1


Second Deriv -> Fourth Deriv

D =

1 -1 0 0 0 0 0 0 0 0

-1 2 -1 0 0 0 0 0 0 0

0 -1 2 -1 0 0 0 0 0 0

0 0 -1 2 -1 0 0 0 0 0

0 0 0 -1 2 -1 0 0 0 0

0 0 0 0 -1 2 -1 0 0 0

0 0 0 0 0 -1 2 -1 0 0

0 0 0 0 0 0 -1 2 -1 0

0 0 0 0 0 0 0 -1 2 -1

0 0 0 0 0 0 0 0 -1 1

D' * D =

1 -2 1 0 0 0 0 0 0 0

-2 5 -4 1 0 0 0 0 0 0

1 -4 6 -4 1 0 0 0 0 0

0 1 -4 6 -4 1 0 0 0 0

0 0 1 -4 6 -4 1 0 0 0

0 0 0 1 -4 6 -4 1 0 0

0 0 0 0 1 -4 6 -4 1 0

0 0 0 0 0 1 -4 6 -4 1

0 0 0 0 0 0 1 -4 5 -2

0 0 0 0 0 0 0 1 -2 1


164/326


165/326

13


Still Piecewise Cubic

10 20 30 40 50

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

10 20 30 40 50

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9



10 20 30 40 50

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

10 20 30 40 50

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

10 20 30 40 50

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

10 20 30 40 50

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

10 20 30 40 50

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

10 20 30 40 50

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9


166/326

14


Regularization to RBFs

l Alternative way to motivate RBFs


Gaussian

Every

Training Point



(Old solutions )l One per data point can be way to many




167/326

15


Too Many Centers 2


l Compute the derivative of E(y) w.r.t. the centers This gets very hairy and does not work well



Too Many Centers 3l Support Vector Regression

xdxfx

yxf

fSmoothnessfErrorfCost

j

jj )(

)(

)()()(

2

+=

+=

=j

jj xxKwxf ),()(

Many js

are zero!!!


168/326

1



Networks

Lecture 13:

Kernel Networks

on to Unsupervised Learning


Newsl Quizes are graded

Each problem has been graded.

** The overall score for the quiz is being determined. We ran out of time last night.

l Course grading: (approximate) Psets: 35%

Quiz: 20%

Final: 30%

Project: 10%

Participation: 5%


169/326

2


Pset 3

You are doing spectacularly well


Exams


170/326

3


Grading alternatives


Review & Overviewl Lecture 12:

Trying to find smooth functions.

Requiring smoothness simplifies functions: 1st deriv -> piecewise linear; 2nd deriv -> cubic

l Finish off Regression

l Begin unsupervised learning. PCA, etc


171/326


172/326


173/326

6


Cubics are similar...

( )

+=

j

jj

x

fyxffC

2

2

22

)()(

=

j

j

jxxKbxf ),()(

?

2)(),( jjj xxxxxxK =)(),(

4

4j

j

xx

x

xxK=

=j

j

j xxaxf )()(

=j

j

jx

xxKbxf

4

4 ),()(


Can also get gaussian kernels

l if you want them!!


Gaussian

Every

Training Point


174/326

7



10 20 30 40 50

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

10 20 30 40 50

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

10 20 30 40 50

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

10 20 30 40 50

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

10 20 30 40 50

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

10 20 30 40 50

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9



(Old solutions )l One per data point can be way too many




175/326

8


Too Many Centers 2


l Compute the derivative of C(f) w.r.t. the centers This gets very hairy and does not work well



Support Vector Regressionl Support Vector Regression

bxwxf T +=)( wwTmin

( )

( )

+

+

bxwy

ybxw

jTj

jjT


176/326


177/326

1


New Topic: Unsupervised Learning

l What can you do to understand data when youhave no labels? Find unusual structure in the data.

Find simplifications of the data.

l Find the clusters in the data: Fit a mixture of gaussians

Been there, done that.

l Reduce the dimensionality of the data: Find a linear projection from high to low dimensions

l These all amount to density estimationl There are many other approaches

Build a tree which captures the data, etc.


8 bits per pixel .36 bits per pixel


178/326

1




179/326

1




180/326

1



Networks

Lecture 14:

on to Unsupervised Learning


Newsl I will try to give you a feeling for where we are

headed: Next 4 lectures

Bayes Nets / Graphical Models / Boltzman Machines /HMMs

After that a series of topics ( from papers).


181/326

2


Review & Overview

l Lecture 13: The end of regression

l Begin unsupervised learning. PCA, etc


New Topic: Unsupervised Learningl What can you do to understand data when you

have no labels? Find unusual structure in the data.

Find simplifications of the data.

l Find the clusters in the data: Fit a mixture of gaussians

Been there, done that.

l Reduce the dimensionality of the data:

Find a linear projection from high to low dimensionsl These all amount to density estimation

l There are many other approaches Build a tree which captures the data, etc.


182/326

3


Exploratory Data Analysis

l Machine learning is simply not that smart

l It is still very important to look at the data.

l But when there are millions of examples andthousands of dimensions you cannot look at thedata.

l It is very important to summarize the data Summarize the statistics

Reduce dimensionality


Example: Mixture of Gaussiansl Start out with a many training points (1000s)

Distributed in a clumpy fashion.

l Replace with a few summary clusters (10s) One for each clump.

l Why? Better understand/summarize the data

Demographics Data: growing number of stay at home dadsl Salary < 1000; Children > 0; Family Salary > 20,000, etc.

Astronomical Data: Unusual distribution of stars near galaxy

Extract symbolsl one cluster per word (in speech), one cluster per letter (in writing)

Speed learning Learn with the clusters rather than all the data approximate


183/326

4


Example of Clustering

l Can I get some examples of clustering???

l PDP??

l Andrew Moore


l Regression & Kernel Networks

Learning time is cubic in the number of examples.

Need to solve the linear system to find weights.

Performance time is linear in the number of examples

10,000 examples -> 100 clusters??

l Problem: these clusters are not optimal

Other locations may be much more effective forapproximating the target function.

Clusters congregate near data not at the margin!

Speed Learning???

=j

j

jxxKbxf ),()(


184/326

5



l Claims to choose the optimal set of examples 10,000 -> 100

Performance is optimal

l But, training time is still cubic in the number ofexamples.

l Though some new algorithms are beginning to work

more quickly.

=j

j

jxxKbxf ),()(


Example: Dimensionality Reductionl It is difficult to visualize 10, 20, 1000 dimensions

l Worse, there is reason to believe that it can bevery difficult to learn in high dimensions.

l Curse of dimensionality: As the dimensionality of the data grows, the average

distance between points also grows.

As a result it is difficult to get an accurate localestimate for what is going on.


185/326

6


Curse of Dimensionality of Nearest

Neighbor

l How far is it to your nearest neighbor?? Easier Question: How far do you have to look before

expecting 1 neighbor.

Assume points are distributed in a sphere of unit radius

The point in the center of the sphere should have thelargest number of neighbors. How far do we have to go

to find 1 neighbor on average??

k

kkrcrS =))((Vol


Reducing Dimensionality Linearly

l Where W has fewer rows than columns

l What is the right W?? One that preserves as much information as possible.

Ah, but what is information??

Wxy=

( )( )2min WxWxEW

=

M

2

1

e

e

W


186/326

7


First Eigenvector preserves more info...


256,000Numbers

4,000Numbers


187/326


188/326

9


Dimensionality Reduction


Cottrell and Metcalfe


189/326

1


Information Theory for Signal

Separation

l The Cocktail Party Problem

l Many Speakers -- the signals a hopelessly mixed

Microphones

Sound

Sources


Unmixed


190/326

1


ets look at data

Figures from Christian

Unmixed Mixed

PCAUnmixed


Mathematical Assumptions

l Assumptions:

Sounds Travels instantaneously Sound Mixes Linearly

Signals are independent

S

s t

s t

s t

M

m t

m t

m t

A

a a a

a a a

a a a

=

=

=

1

2

3

1

2

3

11 12 13

21 22 23

31 32 33

( )

( )

( )

( )

( )

( )

M AS=


191/326

1


The Unmixing Problem

l We would like to undo the mixing...

$S A AS = 1


Reducing Dimensionality Non-Linearly

l Where W has fewer rows than columns

l What is the right W?? One that preserves as much information as possible.

Ah, but what is information??

)(Wxgy=

( )xWxgMIW

),(max

=

components

tindependenW


192/326

1


ICAUnmixed


Learning Rule

WuyW

WWxyW

WWxyWW

T

TT

TTT

)21(

)21(

))21((

+=

+=

+=


193/326

1



Networks

Lecture 15:

Reasoning and Learning on Discrete Data

Bayes Nets


Newsl Final Problem Set will be ready tomorrow

Mostly Bayes Nets

l Please begin to think about your final project


194/326

2


Review & Overview

l Lecture 14: Principal Components Analysis

A low dimensional projection is can summarize data

Independent Components Analysis An alternative to PCA which can pick out the independent

sources of data.

l Bayes Nets Meeting of the minds

Artificial Intelligence and Machine Learning

Represents symbolic knowledge and reasoning Principled mechanism for inference and learning

Bayes Rule


Artificial Intelligencel Build systems that reason about the world:

Diagnosis Why wont my car start?

Goal directed behavior How can I get from here to the White House?

Space Probe: How do I change orbit, take photos of Mars,and communicate with Earth in the next 5 minutes?

How can I symbolically integrate this function?

Game Playing

How can I beast Kasparov?l Biases:

Symbolic data and symbolic problems (not continuous)

No representation of uncertainty or probability.


195/326

3


Techniques in Artificial Intelligence

l Write down a set of rules that govern the world If I get on a plane to Wash. DC then I will end up in DC.

If I take a taxi to Logan then I will end up at Logan.

If my car is out of fuel then it wont start.

If my starter motor is broken then it wont start.

The integral of sin(x) is - cos(x).

The integral of 2