Upload
jennifer-parker
View
220
Download
0
Embed Size (px)
Citation preview
8/9/2019 Machine Learning Course
1/326
www.GetPedia.com
* The Ebook starts from the next page : Enjoy !
http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/8/9/2019 Machine Learning Course
2/326
6.891 Machine Learning
6.891 Machine Learning and Neural
Networks
6.891 Machine Learning
News
http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/8/9/2019 Machine Learning Course
3/326
6.891 Machine Learning
Review & Overview
6.891 Machine Learning
Course Information
http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/8/9/2019 Machine Learning Course
4/326
6.891 Machine Learning
Grading Experiment!!!
6.891 Machine Learning
Course Goals
8/9/2019 Machine Learning Course
5/326
6.891 Machine Learning
Course Goals:
NIPS 1989
6.891 Machine Learning
Course Goals:
Pitts and McCulloch, 1947
8/9/2019 Machine Learning Course
6/326
6.891 Machine Learning
Goals: Analysis and Computation
6.891 Machine Learning
What is Machine Learning?
8/9/2019 Machine Learning Course
7/326
6.891 Machine Learning
Physical Laws
Newtons Measurements
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Force
Acceleration
6.891 Machine Learning
Physical Laws: Theorize
F=ma
Ignore Errors &
Inconsistencies
Newtons Measurements
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Force
Acceleration
8/9/2019 Machine Learning Course
8/326
6.891 Machine Learning
Different Types of Learned Relations
F = ma, pv = nrt
6.891 Machine Learning
Some Notation
( )Tdxxx ,,, 21 =x
{ ,, 21 CCC=
( )Tyy ,, 21=y
jjx xor
j-th examplej-th example
jj yC &
jt
8/9/2019 Machine Learning Course
9/326
6.891 Machine Learning
Additional Notation (Abusive!)
( )jj xyy = ( )jj xCC =
=j
jj xytlE )(( =
=otherwise
zifzl
1
00)(
2)( zzl = LossLoss
6.891 Machine Learning
Example: Digit Recognition
8/9/2019 Machine Learning Course
10/326
8/9/2019 Machine Learning Course
11/326
6.891 Machine Learning
Tremendous Variety
6.891 Machine Learning
Hand Labeled Data
8/9/2019 Machine Learning Course
12/326
6.891 Machine Learning
Final Performance
6.891 Machine Learning
Differentiating Speech & Music
The Key Issue: Features!
8/9/2019 Machine Learning Course
13/326
6.891 Machine Learning
Speech Recognition
Sound Now is the time...
6.891 Machine Learning
Evaluation of Credit Risk
8/9/2019 Machine Learning Course
14/326
6.891 Machine Learning
Digit Recognition in Detail
1 2
Perim
Width Height
Num Black Pixels
6.891 Machine Learning
Rules for Classification
1 2
One of the featuregraphs here
)()( baxxC +=
otherwise
yify
0
01)(
8/9/2019 Machine Learning Course
15/326
6.891 Machine Learning
UsingBayes Law
1 2
)(
)"2(")"2|"()|"2("
fFP
PfFPfFP
=
===
P F f
P F f P
P F f(" "| )
( |" ") (" ")
( )1
1 1
= =
=
=
P B AP A B P B
P A( | )
( | ) ( )
( )=
6.891 Machine Learning
Combining Features
x1
x2
8/9/2019 Machine Learning Course
16/326
6.891 Machine Learning
6.891 Machine Learning and Neural
Networks
8/9/2019 Machine Learning Course
17/326
6.891 Machine Learning
News
8/9/2019 Machine Learning Course
18/326
6.891 Machine Learning
Review & Overview
8/9/2019 Machine Learning Course
19/326
6.891 Machine Learning
Fitting a Curve to Data
0 0.5 1
0
0.2
0.4
0.6
0.8
1
{ }jj tx ,:Data
?? ??
)(xy
You are given 10 example data points. These are samples of physicalrelationship, perhaps including noise.
You challenge is to make prediction for this relationship
- Interpolation: between the example points - Extrapolation: beyond the data.
In principle there are an infinite number of functions that could be associatedwith this data our challenge is to pick one.
In the final analysis we may want to hedge our bets and return a probabilitydistribution of functions.
8/9/2019 Machine Learning Course
20/326
6.891 Machine Learning
Polynomial Fitting
{ }jj
tx ,:Data
( ) ( ) ( )=+++=M
l
j
l
Mj
M
jj lxwxwxwwxy
1
10)(
{ },,:torWeight vec 10 ww=w
);( wjxy
Imagine that we are constraining ourselves to class of polynomials.
Each M-th order polynomial is parameterized by M+1 parameters.
The learning process, becomes a process by which we select values of W_i,
The dependency of the y(.) function on W will can be highlighted by thenotation y(x; w).
8/9/2019 Machine Learning Course
21/326
6.891 Machine Learning
Graphical Representation
( )=M
l
lj
l
jxwxy )(
X
X0 X1 X2 X3 X9
y
w0 w1 w2 w3w9
...
Scalar Representation
High Dimensional
Non-linear Representation
- While the algebraic notation for y() is clear and specific, we will see thatsometimes is also useful to develop a graphical notation of both classifiersand regression functions.
- This idea was originally popularized in the neural network literature,wherein neural networks were almost always drawn out in their graphicalform.
- The graphical notation points out that an intermediaterepresentation for Xis formed (M+1 exponentiations). The resulting problem is then one oflearning the linear relationship between this high dimensional space and t.
8/9/2019 Machine Learning Course
22/326
6.891 Machine Learning
( ) =j
jjtxylossE );(
2
1w
EmpiricalLoss
Ew
min =w Find Optimalweights
Choose the Best Polynomial
- What defines the best polynomial function? Perhaps it is the one which ismost consistent with the training data?
- Actually we would rather return the function which makes the bestpredictions on future data - unfortunately there may be no source for thisdata.
- For the time being lets assume that we want to find the function which bestagrees with training data the function with the lowest loss.
8/9/2019 Machine Learning Course
23/326
6.891 Machine Learning
Simple Loss Functions Simplify Learning
( ) =j
jjtxyE 2);(
21 w( ) 2 =loss
( )( )
( )( ) 0);(
0);(22
1
==
==
j
ijjj
j
ijjj
i
xtxy
xtxyw
E
w
w
- Certain simple loss functions lead to learning algorithms which are easy toderive and inexpensive to compute.
- For example, squared loss can be solved by differentiating and setting thisto zero.
- The result is a set of linear equations that can be solved by inverting amatrix.
8/9/2019 Machine Learning Course
24/326
6.891 Machine Learning
First order fit
0 0.5 1
0
0.2
0.4
0.6
0.8
1
( ) =j
jj txyE2
);(2
1w
The optimal
function minimizes
the residual error
- There is a pleasant physical analogy for the squared loss. The functionsare connected by springs to the data. The system is then allowed to relaxuntil the forces are balanced. The minimum energy solution is the one thatis closest to the training data.
8/9/2019 Machine Learning Course
25/326
6.891 Machine Learning
Fitting Different Polynomials
0 0.5 1
0
0.2
0.4
0.6
0.8
1
0 0.5 1
0
0.2
0.4
0.6
0.8
1
1
2
0 0.5 1
0
0.2
0.4
0.6
0.8
1
0 0.5 1
0
0.2
0.4
0.6
0.8
1
3
4
0 0.5 1
0
0.2
0.4
0.6
0.8
1
9
0 0.5 1
0
0.2
0.4
0.6
0.8
1
6
Each order of polynomial leads to a different fit.
Higher order polynomials come closer to the training data.
The 9th order polynomial can fit the 10 datapoints perfectly.
Which of these is the most likely to generalize
8/9/2019 Machine Learning Course
26/326
6.891 Machine Learning
Target Function
)2sin(4.05.0)( xxh +=
)(jj
xht =
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
The function that generated the data was not a polynomial at all.
8/9/2019 Machine Learning Course
27/326
8/9/2019 Machine Learning Course
28/326
6.891 Machine Learning
Matlab Code
% Construct training data
train_in = [1:10]/10;
train_out = 0.5 + 0.4 * sin(2 * pi * train_in) + 0.1 * randn(size(train_in));
% fit a polynomial
order = 3
p = polyfit(train_in, train_out, order)
% construct a test set
test_in = [1:300]/300;
true_out = 0.5 + 0.4 * sin(2 * pi * test_in);
% compute the polynomial prediction
fit_out = polyval(p, test_in);
% plot the results
% first: training data
% second: test data
% third: predictionsplot(train_in, train_out, o, test_in, true_out, test_in, fit_out)
axis([0 1 -0.1 1.1])
Above is all the code used to generate the previous graphs.
As we can see Matlab allows us to explore issues in machine learningwithout much hacking.
8/9/2019 Machine Learning Course
29/326
6.891 Machine Learning
First General Problem in Learning
There are a few general (grand challenge) problems in machine learning.Perhaps the most important is the problem of controlling complexity. Wehave seen that a simpler approximator can fit an unknown function betterthan a more complex approximator. Building a theory for this is one grandchallenge of learning.
Clearly this problem has been appreciated for a long time.
Vapniks statement is perhaps the most confusing. When he says theory
what he means is something like a learning algorithm. The learningalgorithm which fits 9th order polynomials to 10 datapoints is not falsifiable.
- No set of datapoints would fail to be fit perfectly by this data.
8/9/2019 Machine Learning Course
30/326
6.891 Machine Learning
Overfitting in Classification
This is not to say that such problems are uniuqe to regression. Determiningdecision boundaries for classification is very similar.
We need to balance the complexity of the boundary against the accuracy on
training data.
8/9/2019 Machine Learning Course
31/326
8/9/2019 Machine Learning Course
32/326
6.891 Machine Learning
Recall the probabilistic approach
P(F|C1) P(F|C2)
8/9/2019 Machine Learning Course
33/326
6.891 Machine Learning
Probabilistic Approach
Thomas Bayes1702-1761
)(
)()|()|(
xXP
CCPCCxXPxXCCP kkk
=
======
)(
)()|()|(
xP
CPCxPxCP kkk =
),( xXCCP k ==
)2|(&)1|( CXPCXP
8/9/2019 Machine Learning Course
34/326
6.891 Machine Learning
Probability Densities
==b
a
dxxXpbaXP )(]),[(
)()( xXpxpp(x) X ===
)(
)()|()|(
xXp
CCPCCxXpxXCCP kkk =
======
)()()|()|(
xpCPCxpxCP kkk =
]),[()( baXPdbdxXp ==
Probability Densities are necessary because for a continuous randomvariable the probability of every event is zero.
The density measures the slope of the cumulative distribution function.
Alternatively it is the probability per unit area (or length, or volume)
measured over an infinitesimal area.
Somewhat surprisingly the density used in the same way that the distributionfunction is used. In other words the probability distributionof the class givethe feature value can be found using the densities of the features.
8/9/2019 Machine Learning Course
35/326
6.891 Machine Learning
Bayes Law for Densities
>
=otherwise
)|()|()(
2
211
xPxPxC
Duda & Hart, 1973
Class 1 Class 2
Just as before we can graph the conditional probability of class givenfeature. The functions are now continuous
Given the Bayes classification rule, a set of decision regions are defined.
8/9/2019 Machine Learning Course
36/326
6.891 Machine Learning
Decisions
These analyses are easily generalized to:
- Multiple classes
- Multiple dimensions
8/9/2019 Machine Learning Course
37/326
8/9/2019 Machine Learning Course
38/326
8/9/2019 Machine Learning Course
39/326
6.891 Machine Learning
Probabilistic Classification Review
P(F|C) & P(C) -> P(F,C)
8/9/2019 Machine Learning Course
40/326
6.891 Machine Learning
Information Retrieval
8/9/2019 Machine Learning Course
41/326
6.891 Machine Learning
Keyword Search Works Well
8/9/2019 Machine Learning Course
42/326
6.891 Machine Learning
Nave Bayes Classifier
)|( ji CFPProbability of word i appearing
in a Doc from Class j
)( ji Docf 1 if Doc j has word i.
)|()|}({ jii
iji CfFPCfP ==
)(
)|()(
}){|(i
i
i
ji
i
ij
ijfFP
CfFPCP
fCP=
=
=
8/9/2019 Machine Learning Course
43/326
6.891 Machine Learning
{Docs}#
i}wordcontainingDocs{#)( =iFP
Docs}{Training#
i}wordwithDocsTraining{#)|( =ji CFP
Potential Bug:
None of our Training Docs contain Mercedes
Estimating Probabilities
8/9/2019 Machine Learning Course
44/326
6.891 Machine Learning
8/9/2019 Machine Learning Course
45/326
8/9/2019 Machine Learning Course
46/326
6.891 Machine Learning
Density Estimation is Ambiguous
8/9/2019 Machine Learning Course
47/326
6.891 Machine Learning
Impacts Classification
8/9/2019 Machine Learning Course
48/326
8/9/2019 Machine Learning Course
49/326
Machine Learning
Review & Overview
Machine Learning
Keyword Search Works Well
8/9/2019 Machine Learning Course
50/326
Machine Learning
Bayesian Text Classification
Probability of word i appearing
in a Doc from Class j
{ }
)|(
)|(
)|,,(
1
2211
ji
i
i
jN
j
cCfFP
cCffP
cCfFfFP
====
===
Assume
Independence
Assume
Independence
ijji pcCFP === )|1(
{ }
otherwise
d)(dW
d
k
ki
k
0
iwordcontainsif1:
documentsofcollectionA:
Machine Learning
Bayes Nets Show Dependencies
)|()|}({ jii
iji CfFPCfP ==
F1 F2 F3 F4 FN
C
...
)|( 1 CFP
Bayes Nets show the dependencies between RVs
8/9/2019 Machine Learning Course
51/326
Machine Learning
Classification Using Bayes Law
)(
)|()(
}){|(
=
i
i
ji
i
j
ijfP
cfPcP
fcP
DocumentsOther
CarsGerman
0
1
=
=
c
c
Machine Learning
Estimating Probability Distributions
ijji pcCFP === )|1(
{ }
=otherwise
df)(dW
d
k
kiki
k
0
iwordcontainsif1:
documentsofcollectionA:
p_ij?
p_ij
8/9/2019 Machine Learning Course
52/326
Machine Learning
Maximum Likelihood
{ }( ) ( ) { }( )
( )
( ) ( )( )
( ) ( )( )iijiij
k i
kiij
kiij
k i
ki
k
kNk
k
kk
nNpnp
fpfp
cfP
cffPcdPcdP
=
=
=
==
1
11
|
|||
0
0100
Machine Learning
Log Likelihood
( ) ( )( )
)1log()()log(
1log
ijiiji
iij
iij
pnNpn
nNpnpL
+=
=
0
1
1)(
1
)1log()(
)log(
=
+=
+
=
ij
i
ij
i
ij
ij
i
ij
ij
i
ij
pnN
pn
p
pnN
p
pn
p
L
8/9/2019 Machine Learning Course
53/326
Machine Learning
Maximum Likelihood
ij
ij
i
i
ij
ij
i
i
ij
i
ij
i
ij
i
ij
i
p
p
NnNn
p
p
nN
n
p
nN
p
n
pnN
pn
=
=
=
=+
11
1)(
1
)(
01
1)(1
N
np iij=
Machine Learning
{Docs}#
i}wordcontainingDocs{#)1( ==iFP
Docs}{Training#
i}with wordDocsTraining{#)|1( == ji CFP
Potential Bug:
None of our Training Docs contain Mercedes
Estimating Probabilities
8/9/2019 Machine Learning Course
54/326
Machine Learning
Prior Expectations
p(mercedes | GermanCars)?
Machine Learning
Bayesian Parameter Estimation
Maximum A PosteoriMaximum A Posteori
{ }( ) { }
{ }( )0
0
0|
)(,|,|
cdP
pppcdPcdpP
k
ijijk
kij =
Maximum LikelihoodMaximum Likelihood
{ } ijk
ij
pcdPp
,|max 0
This turns out to be more useful
for continuous parameters
This turns out to be more useful
for continuous parameters
8/9/2019 Machine Learning Course
55/326
Machine Learning
What is the right prior?
{ } { }
{ }( )ijkijijkkij
pcdP
pppcdPcdpP
,|max
)(,|max,|max
0
00
=
= MaximumLikelihood
{ }( ) ( ) ( )( )iijiijijknNpnppcdP = 1,| 0
Machine Learning
Probability of the parameters
{ }( ) ( ) ( )( )iijiijijkCNpCppcdP = 1,| 0
8/9/2019 Machine Learning Course
56/326
Machine Learning
Bayesian Estimation
{ }( ) { }
{ }( )0
0
0|
)(,|,|
cdP
pppcdPcdpP
k
ijijk
kij = ijijji ppcCFP === ),|1(
{ }( )= ijjkijijjiji dpcdpppcFPcFP ,|),|()|({ }
{ }( )
{ }( ){ }( )0
0
0
0
|
)(,|
|
)(,|
cdP
pppcdPp
cdP
pppcdPp
k
ijijkij
k
ijijk
ij
=
=
Machine Learning
Continued
{ }( )
{ }( )
( ) ( )[ ]
( ) ( )[ ]
=
=
ij
nN
ij
n
ij
ij
nN
ij
n
ijij
ijijijk
ijijijkij
ji
dppp
dpppp
dppppcdP
dppppcdPpcFP
ii
ii
1
1
)(,|
)(,|)|(
0
0
8/9/2019 Machine Learning Course
57/326
8/9/2019 Machine Learning Course
58/326
Machine Learning
6.891 Machine Learning and Neural
Networks
Machine Learning
News
8/9/2019 Machine Learning Course
59/326
Machine Learning
Review & Overview
Machine Learning
Why Gaussians ?
8/9/2019 Machine Learning Course
60/326
8/9/2019 Machine Learning Course
61/326
Machine Learning
Recall: Bayes Decision Boundaries
Machine Learning
Descriminant Function
8/9/2019 Machine Learning Course
62/326
Machine Learning
Set Discriminants Equal
Machine Learning
8/9/2019 Machine Learning Course
63/326
Machine Learning
Bayesian Parameter Estimation
Machine Learning
Convergence of Probability
8/9/2019 Machine Learning Course
64/326
Machine Learning
Reminder: Why we are here
-2 -1 0 1 2 3 4 50
0.5
1
1.5
2
2.5
3
3.5
4
0.1781 0.2013
0.8293
0.8299
0.9217
0.9434
0.9851
1.0079
1.0539
1.5355
1.5621
1.5875
1.6015
2.1811
2.7845 2.7879
3.0956
3.8428
3.9562
4.0800
0.1781 0.2013
0.8293
0.8299
0.9217
0.9434
0.9851
1.0079
1.0539
1.5355
1.5621
1.5875
1.6015
2.1811
2.7845
2.7879
3.0956
3.8428
3.9562
4.0800
-1.2660 -0.8724
-0.8081
-0.6223
-0.1624
-0.1342
-0.1098
-0.0882
0.1258
0.1395
0.1914
0.2873
0.3409
0.3694
0.6093 0.6463
1.1217
1.1463
1.3021
1.3971
-1.2660 -0.8724
-0.8081
-0.6223
-0.1624
-0.1342
-0.1098
-0.0882
0.1258
0.1395
0.1914
0.2873
0.3409
0.3694
0.6093
0.6463
1.1217
1.1463
1.3021
1.3971
Machine Learning
Max Likelihood Gaussian
Mean: 0.16
StDev: 0.8
Mean: 2.2
StDev: 1.0
8/9/2019 Machine Learning Course
65/326
Machine Learning
Different Samples, Different Decisions
Concept: Variance
The Variation you observe
when training on different
independent training sets
Concept: Variance
The Variation you observe
when training on different
independent training sets
Machine Learning
Variance depends on data set size...
20 points 2000 points
8/9/2019 Machine Learning Course
66/326
Machine Learning
But when data gets more complex...
Machine Learning
Gaussian dont work well
Concept: Training Error
Error in your classifier
on the training set
Concept: Training Error
Error in your classifier
on the training set
8/9/2019 Machine Learning Course
67/326
8/9/2019 Machine Learning Course
68/326
8/9/2019 Machine Learning Course
69/326
Paul Viola 1999 Machine Learning 3
Review & Overview
Paul Viola 1999 Machine Learning 4
Histogram -1.2660 -0.8724
-0.8081
-0.6223
-0.1624
-0.1342
-0.1098
-0.0882
0.1258
0.1395
0.1914
0.2873
0.3409
0.3694 0.6093
0.6463
1.1217
1.1463
1.3021
1.3971
-1.2660
-0.8724
-0.8081
-0.6223
-0.1624
-0.1342
-0.1098
-0.0882
0.1258
0.1395
0.1914
0.2873
0.3409
0.3694
0.6093
0.6463
1.1217
1.1463
1.3021
1.3971
Divide by N to
yield a probability
-1.5 -1 -0.5 0 0.5 1 1.5
0
0.5
1
1.5
2
2.5
3
3.5
4
8/9/2019 Machine Learning Course
70/326
Paul Viola 1999 Machine Learning 5
Simple Algorithm
function counts = myhist(data, centers)
% Initialize counts
counts = zeros(size(centers));
numdata = size(data,1);
% For each datapoint compute distance to every center
for i = 1:numdata
diffs = data(i,1) - centers;
dists = diffs.^2;
[minval mindist] = min(dists); counts(mindist) = counts(mindist) + 1;
end
Paul Viola 1999 Machine Learning 6
Histogram
8/9/2019 Machine Learning Course
71/326
Paul Viola 1999 Machine Learning 7
Max Likelihood Gaussian
Mean: 0.16
StDev: 0.8
Mean: 2.2
StDev: 1.0
Paul Viola 1999 Machine Learning 8
Histogram Flexibility is Adjustable
-2 - 1 0 1 2 3 4 50
1
2
3
4
5
6
7
8
-2 - 1 0 1 2 3 4 50
1
2
3
4
5
6
7
8
-2 - 1 0 1 2 3 4 50
1
2
3
4
5
6
7
8
-2 - 1 0 1 2 3 4 50
1
2
3
4
5
6
7
8
-2 - 1 0 1 2 3 4 50
1
2
3
4
5
6
7
8
-2 - 1 0 1 2 3 4 50
1
2
3
4
5
6
7
8
8/9/2019 Machine Learning Course
72/326
Paul Viola 1999 Machine Learning 9
Histograms have lower bias
Paul Viola 1999 Machine Learning 10
but higher variance.
8/9/2019 Machine Learning Course
73/326
Paul Viola 1999 Machine Learning 11
Parzen: One Bump per Data Point
-1.5 -1 -0.5 0 0.5 1 1.5
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Paul Viola 1999 Machine Learning 12
Parzen Algorithm
function [func, range] = parzen(data, sigma)
range = splitrange(min(data), max(data), 500);
numdata = size(data, 1)
% For each point on range compute distance to every
% datapoint.
for i = 1:size(range, 2)
gaussvals = gauss(range(i) - data, 0, sigma);
func(i) = sum(gaussvals)/numdata;
end
plot(range, func)
8/9/2019 Machine Learning Course
74/326
Paul Viola 1999 Machine Learning 13
Parzen and Histogram are Similar
Paul Viola 1999 Machine Learning 14
All Three at Once
-1.5 -1 -0.5 0 0.5 1 1.5
0.1
0.2
0.3
0.4
0.5
0.6
0.7
8/9/2019 Machine Learning Course
75/326
Paul Viola 1999 Machine Learning 15
Properties of Non-parametric Techniques
Paul Viola 1999 Machine Learning 16
Semi-Parametric Models
8/9/2019 Machine Learning Course
76/326
Paul Viola 1999 Machine Learning 17
11,Gaussian
22 ,Gaussian
Flip
Coin
),|( 11xp
),|( 22xp
jx
jx
1)2()1( ==+= kPkP
???)(xp
Paul Viola 1999 Machine Learning 18
Events are Disjoint -> They add
)2(),|()1(),|(
)2()2|()1()1|(
)2,()1,(
2211 =+==
===+====
==+===
JPxpJPxp
JPJxXpJPJxXp
JxXpJxXp
)( xXp =
8/9/2019 Machine Learning Course
77/326
Paul Viola 1999 Machine Learning 19
Face Detection
Paul Viola 1999 Machine Learning 20
Generating Training Data
Sung &
Poggio
8/9/2019 Machine Learning Course
78/326
Paul Viola 1999 Machine Learning 21
Results
But, it takes minutes per image
Paul Viola 1999 Machine Learning 22
Face Detection
8/9/2019 Machine Learning Course
79/326
Paul Viola 1999 Machine Learning 23
Events are Disjoint -> They add
)2(),|()1(),|(
)2()2|()1()1|(
)2,()1,(
2211 =+==
===+====
==+===
JPxpJPxp
JPJxXpJPJxXp
JxXpJxXp
)( xXp =
Paul Viola 1999 Machine Learning 24
Expectation Maximization
=
j
j
j
jj
kxkP
xxkP
)|(
)|(
( )
=
j
j
j
k
jj
kxkP
xxkP
)|(
)|(2
2
=j
jxkPkP )|()(
( )},,{log kkk qlE =
Bounded Below?
Decreases?
8/9/2019 Machine Learning Course
80/326
Paul Viola 1999 Machine Learning
News
Paul Viola 1999 Machine Learning
Distribution of Grades: Pset 1
8/9/2019 Machine Learning Course
81/326
Paul Viola 1999 Machine Learning
Review & Overview
Paul Viola 1999 Machine Learning
Where are we?
8/9/2019 Machine Learning Course
82/326
Paul Viola 1999 Machine Learning
Between density and classification
Paul Viola 1999 Machine Learning
Gaussians vs. Discriminants
N2
N
o
Twxxy += w)( Two Class Gaussian
same Covariance
Two Class Gaussian
same Covariance
8/9/2019 Machine Learning Course
83/326
Paul Viola 1999 Machine Learning
Linear Discriminant
o
Twxxy += w)(
X0 X1 X2 Xd
y
w0 w1 w2
wd
=
==N
i
ii
Txwxxy
0
)( w
Bias
Warning!
Paul Viola 1999 Machine Learning
Multiple Discriminants
o
Twxxy 111 )( += w
0
0)()(
)()(
0
201021
202101
21
=+
=+
+=+
=
wx
wwx
wxwx
xyxy
T
T
TT
w
ww
ww
o
Twxxy 222 )( += w
8/9/2019 Machine Learning Course
84/326
Paul Viola 1999 Machine Learning
in a single network
+=i
koikik wxwxy )(
matrixweightwki
)(maxarg)( xykifCxC ii
k ==
Paul Viola 1999 Machine Learning
Multiple Discriminants
Intersection
of Half Planes
8/9/2019 Machine Learning Course
85/326
8/9/2019 Machine Learning Course
86/326
Paul Viola 1999 Machine Learning
Quadratic cost is very simple
( ) ( )
TTTWXXWXW
TXWTXWWE
TTTTT
T
+=
=
2
)(( )
( )2
2)()(
=
=
j
jjT
j
jj
tx
txyE
w
w
( ) TXXXW
TXXWX
TXXWXdW
WdE
TT
TT
TT
1
022)(
=
=
==
Paul Viola 1999 Machine Learning
Is this a model for the brain?
Pitts and McCulloch, 1947
8/9/2019 Machine Learning Course
87/326
8/9/2019 Machine Learning Course
88/326
Paul Viola 1999 Machine Learning
Cant Always Solve for the Weights
)()( xwgxyT=
=otherwise
aifag
0
01)(
=
otherwise
aifag
1
01)(
Paul Viola 1999 Machine Learning
Perceptron
8/9/2019 Machine Learning Course
89/326
Paul Viola 1999 Machine Learning
Perceptron Cost Function
( )
( )w
xwgtxwg
w
wE
txwgE
jT
j
jjT
j
jjT
=
=
)()(2
)(
)()(2
w
( )
=
=
=
errors
jj
errors
jT
j
jTjjT
xtw
wE
txw
txwtxwgE
2)(
)(
)()()(2
w
Perceptron
Criterion
jj
tt xtww = 1
Simple Gradient
Descent does not work
Paul Viola 1999 Machine Learning
Different Error Measures
8/9/2019 Machine Learning Course
90/326
Paul Viola 1999 Machine Learning
Perceptron Learning
X0 X1 X2 Xd
y
w0 w1 w2wd
Paul Viola 1999 Machine Learning
Real Perceptrons
8/9/2019 Machine Learning Course
91/326
Paul Viola 1999 Machine Learning
Paul Viola 1999 Machine Learning
A classic problem...
x x
x
x x
x x
x
x x
o o
o
oo o
o
oo
o o
o
oo o
o
oo
8/9/2019 Machine Learning Course
92/326
1
Paul Viola 1999 Machine Learning
6.891 Machine Learning and Neural
Networks
Lecture 7:
Multi-Layer Perceptrons
Back Propagation
Paul Viola 1999 Machine Learning
Newsl Pset 3 is on the web
Includes a classifier shootout
The mystery dataset has 20 dimensions and two classes
Winner gets $10 of Toscaninis
l Pset 2 looks great Many of you did a lot of work.
8/9/2019 Machine Learning Course
93/326
2
Paul Viola 1999 Machine Learning
Review & Overview
l Lecture 6: Linear Discriminants
Perceptrons
Training Perceptrons
l Generalized Perceptrons
l Multi-layer Perceptrons Multi-Layer Derivatives
Back Propagation
l Examples: NET Talk
Paul Viola 1999 Machine Learning
On-line learning of Perceptrons
jj
t
j
tt xww
Eww =
= 11
X0 X1 X2 Xd
y
w0 w1 w2wd
l Pick Random Example
l Observe Output/Errorl Adjust Weights to
Reduce Error
=j
jEE )(w1: Error Function
(Criteria)
2: Update Rule
8/9/2019 Machine Learning Course
94/326
3
Paul Viola 1999 Machine Learning
Different Criteria
( )
=
=
=
errors
jj
errors
jT
j
jTjjT
xtw
wE
txw
txwtxwgE
2)(
)(
)()()(2
w( )2)( =j
jjTtxE ww
( )
=
=
j
jj
j
jjjT
x
xtxw
E
2
2)(
ww
jj
tt xww = 1
jj
tt xtww = 1
Linear Discriminant Perceptron
Paul Viola 1999 Machine Learning
Normalizing examples
j
tt xww = 1For errors
only!
8/9/2019 Machine Learning Course
95/326
4
Paul Viola 1999 Machine Learning
The update rule in action...
j
tt xww = 1
Paul Viola 1999 Machine Learning
Real Perceptrons
8/9/2019 Machine Learning Course
96/326
5
Paul Viola 1999 Machine Learning
A classic problem...
x x
x
x x
x x
x
x x
oo
o
o o o
o
oo
oo
o
o o o
o
oo
Paul Viola 1999 Machine Learning
Generalized Perceptron
2121 2)( xxxxxXOR += }1,0{ix
)()( xwgxy T= Cant do that!
=21
2
1
xx
x
x
x
=
1.2
1
1
w WorksGreat
8/9/2019 Machine Learning Course
97/326
8/9/2019 Machine Learning Course
98/326
8/9/2019 Machine Learning Course
99/326
8
Paul Viola 1999 Machine Learning
Multiple Layers
=W
=
5.2110
0115.1
0000
0000
0000
W
How can we learn this??X1 X2
y
w54
w52 w53
1
w41 w42w43
u1 u2 u3
u4
Paul Viola 1999 Machine Learning
1986: The Rebirth of Neural Networksl PDP Group had
Huge Impact
8/9/2019 Machine Learning Course
100/326
9
Paul Viola 1999 Machine Learning
1980s: Perhaps Gradient Descent?
)()( xwsxy T=
aeas +
=1
1)(
( )
( )w
xwstxws
w
wE
txwsE
jT
j
jjT
j
jjT
=
=
)(
)(2)(
)()(2
w
( )w
uusus
w
us
=
)(1)()(
( ) ( ) jjTjTj
jjTxxwsxwstxws
w
wE)(1)()(2
)( =
Paul Viola 1999 Machine Learning
Sigmoid Multi-Layer Network
))(
)((
)()(
35325215165
34324214154
565454
uwuwuwsw
uwuwuwsws
uwuwsxy
+++++=
+=
( )
( ) wxws
txwsw
wE
txwsE
jT
j
jjT
j
jjT
=
=
)(
)(2)(
)()(2
w
X1 X2
y
w64
w52
w65
1
w41w42
w43
u1 u2 u3
u4u5
u6
w53w51
w10
8/9/2019 Machine Learning Course
101/326
10
Paul Viola 1999 Machine Learning
Multi-Layer Conventions
=
j
jkjk uwgu
=j
jkjk uwa
l Networks must not haveloops
l
Units are ordered: i > k --> uiis not an input to ukl Compute Units in order
X1 X2
y
w64
w52
w65
1
w41w42
w43
u1 u2 u3
u4u5
u6
w53w51
Paul Viola 1999 Machine Learning
More Conventions
l If Units are organized in Layers Layers can be computed in parallel.
= 1212 * uWu g
= 2323 * uWu g
8/9/2019 Machine Learning Course
102/326
1
Paul Viola 1999 Machine Learning
Paul Viola 1999 Machine Learning
8/9/2019 Machine Learning Course
103/326
12
Paul Viola 1999 Machine Learning
Solving XOR (big deal?)
Paul Viola 1999 Machine Learning
Vision Applications (sort of)
Ts vs. Cs
8/9/2019 Machine Learning Course
104/326
13
Paul Viola 1999 Machine Learning
Very Simple Solution
Paul Viola 1999 Machine Learning
NETtalk (1986) First Real Applicationl Task: Pronounce English text
Text -> Phonemes
l Example: This is it. /s/ vs. /z/
l 29 possible characters
l 26 phonemes
l 7 Character Window
l Structure 203 Inputs
26 Output
80 Hidden
95% Accurate
8/9/2019 Machine Learning Course
105/326
1
Paul Viola 1999 Machine Learning
6.891 Machine Learning and Neural
Networks
Lecture 8:
Back Prop and Beyond
Paul Viola 1999 Machine Learning
Newsl Mid-term will be on 10/20
Here in this room.
It should take about 1 hour but we will give you 1.5 Show up on time, please.
Coverage: Psets 1, 2 and 3.
Density estimation (Parametric, Semi and Non-parametric)
Bayesian Classification
Discriminants (Linear, Perceptron, Multi-layer)
8/9/2019 Machine Learning Course
106/326
2
Paul Viola 1999 Machine Learning
Review & Overview
l Lecture 7: Multi-Layer Derivatives
Back Propagation
Examples: NET Talk
l Why 6.891 is not Over! Bugs with Gradient Descent
Local Min
Bias and Variance How many units?
Variants
Paul Viola 1999 Machine Learning
Face Detection Network:
General Layout
Baluja, Rowley, and Kanade
8/9/2019 Machine Learning Course
107/326
3
Paul Viola 1999 Machine Learning
Intensity Preprocessing
Paul Viola 1999 Machine Learning
Training Data
Positives
Negativ
es
8/9/2019 Machine Learning Course
108/326
4
Paul Viola 1999 Machine Learning
Performance
Paul Viola 1999 Machine Learning
8/9/2019 Machine Learning Course
109/326
5
Paul Viola 1999 Machine Learning
MLP: How Powerful?
Paul Viola 1999 Machine Learning
Derivatives are Cheapl Modeling Factories/Plants
PlantInput
Control
Materials
Outputs
Products
MLP
Derivatives
Train with Back Prop
Use Derivs to Modify input to improve output
8/9/2019 Machine Learning Course
110/326
6
Paul Viola 1999 Machine Learning
1990: The height of MLPs and Back Prop
l Multi-layer perceptrons can solve anyapproximation problem (in principle) Given 3 layers
Given and infinite number of units and weights
l There is no direct technique for finding theweights (unlike linear discriminants)
l Gradient descent (using Back Prop) comes todominate discussion in the Neural Net community Can you find a good set of weights quickly?
How can you speed things up? Will you get stuck in local minima?
l A small group in the community also worries aboutgeneralization.
Paul Viola 1999 Machine Learning
How long til we find the Min?l Simplest Case
1 Weight, Quadratic Error Function w
y
x
2
2
2
)0(
)()(
w
w
ywxwE
=
=
=
1
1
=
t
ttw
Eww
ww
E
t
21
=
1
2
1=
8/9/2019 Machine Learning Course
111/326
7
Paul Viola 1999 Machine Learning
Scale the Input
l Simplest Case 1 Weight, Quadratic Error Function w
y
x
2
2
4
)02()(
w
wwE
=
=
1
1
=
t
ttw
Eww
ww
E
t
81=
2
Hack 1: Start eta very small
Increase if Error decreasesMore Hacks Coming!
Paul Viola 1999 Machine Learning
Multiple Weights
0.020
0.047
0.049
0.050
Hack 2: Momentum
8/9/2019 Machine Learning Course
112/326
8
Paul Viola 1999 Machine Learning
Momentum
1
1
+
= t
t
t ww
Ew
Paul Viola 1999 Machine Learning
l Gradient descent assumes a locally linear costfunction:
l Second order techniques assume locally quadratic:
Second Order Techniques
Ew
Ew
t
t =
=
1 ( )2
)()(
)()()(
wEwE
wEwEwE
=
++=+
E
Ewt
=
a
bw
E
E
aE
bawE
cbwawwE
2
2
2
)( 2
+=
=
+=
++=
( )
ab
abww
www
2
200
01
=
+=
+=
8/9/2019 Machine Learning Course
113/326
8/9/2019 Machine Learning Course
114/326
8/9/2019 Machine Learning Course
115/326
1
Paul Viola 1999 Machine Learning
No Hands Across America
Paul Viola 1999 Machine Learning
Zip Codes
Le Cun
8/9/2019 Machine Learning Course
116/326
1
Paul Viola 1999 Machine Learning
6.891 Machine Learning and Neural
Networks
Lecture 9:
On to Support Vector Techniques
Paul Viola 1999 Machine Learning
Newsl Final will be 12/13 at 1:30PM
If you have a conflicting final let us know.
l Remember that almost all the material appears inthe book Right now we are jumping back and forth between
Chapter 5
Chapter 6
8/9/2019 Machine Learning Course
117/326
8/9/2019 Machine Learning Course
118/326
3
Paul Viola 1999 Machine Learning
Why did we need multi-layer
perceptrons?
l Problems like this seem to require very complexnon-linearities.
l Minsky and Papert showed that an exponential
number of features is necessary to solve genericproblems.
Paul Viola 1999 Machine Learning
Why an exponential number of features?
=M
4
2
3
21
2
2
2
1
2
2
2
12
3
1
4
1
5
2
4
21
3
2
2
1
2
2
3
12
4
1
5
1
,,,,,
...,,,,,
)( xxxxxxxxxx
xxxxxxxxxx
x
14th Order???
120 Features
( )),min(!!
)!( nk knOnk
kn
k
kn +=
+
polyorder:
variables:
k
n
N=21, k=5 --> 65,000 featuresN=21, k=5 --> 65,000 features
8/9/2019 Machine Learning Course
119/326
4
Paul Viola 1999 Machine Learning
MLPs vs. Perceptron
l MLPs are incredibly hard to train Takes a long time (unpredictably long)
Can converge to poor minima
l MLP are hard to understand What are they really doing?
l Perceptrons are easy to train Type of linear programming. Polynomial time.
One minimum which is global.
l Generalized perceptrons are easier to understand. Polynomial functions.
Paul Viola 1999 Machine Learning
Perceptron Training
is Linear Programming
lwi
l
ii > 0x After Normalization
After adding bias
Assumes no errors
Polynomial time in the number of variables
and in the number of constraints.
lsw li
l
ii >+ 0x
lsl > 0l
lsmin
What about linearly inseparable?
8/9/2019 Machine Learning Course
120/326
5
Paul Viola 1999 Machine Learning
Rebirth of Perceptrons
l How to train efficiently. Linear Programming ( later quadratic programming)
l How to get so many features inexpensively?!?
l How to generalize with so many features? Occams revenge.
Support Vector Machines
Paul Viola 1999 Machine Learning
Lemma 1: Weight vectors are simple
l The weight vector lives in a sub-space spanned bythe examples Dimensionality is determined by the number of examples
not the complexity of the space.
t
tw x=00=w
==l
l
l
errors
t
t bw xx =l
l
lt bw )(x
8/9/2019 Machine Learning Course
121/326
6
Paul Viola 1999 Machine Learning
Lemma 2: Only need to compare examples
Paul Viola 1999 Machine Learning
Perceptron Rebirth: Generalizationl Too many features Occam is unhappy
Perhaps we should encourage smoothness?
lsKb lj
j
lj >+ 0),( xx
lsl > 0
l
lsmin
j
jb2min
Smoother
But this is unstable!!
8/9/2019 Machine Learning Course
122/326
7
Paul Viola 1999 Machine Learning
The linear could return any multiple of the correct
weight vector...
lwi
l
ii > 0 x ( ) lwi
l
ii > 0 x
Linear Program is not unique
lswl
i
l
ii >+0x
lsl > 0l
l
smin
i
iw2
min
Slack variables & Weight prior
- Force the solution toward zero
Paul Viola 1999 Machine Learning
Definition of the Margin
l
Margin: Gap between negatives and positivesmeasured perpendicular to a hyperplane
8/9/2019 Machine Learning Course
123/326
8/9/2019 Machine Learning Course
124/326
8/9/2019 Machine Learning Course
125/326
1
Paul Viola 1999 Machine Learning
SVM: examples
Paul Viola 1999 Machine Learning
SVM: Key Ideasl Augment inputs with a very large feature set
Polynomials, etc.
l Use Kernel Trick(TM) to do this efficiently
l Enforce/Encourage Smoothness with weight penalty
Minimize
l Introduce Margin so that:
Set of linear inequalities
l Find best solution using Quadratic Programming
=i
i
T www 2
iwi 0
)(x
8/9/2019 Machine Learning Course
126/326
1
Paul Viola 1999 Machine Learning
SVM: Difficulties
l How do you pick the kernels? Intuition / Luck / Magic /
l What if the data is not linearly separable? i.e. the constraints cannot be satisfied
j
i
ij
i
jcxKwtj
1),()12(
+ j jT cww ||min Slack Variables
Paul Viola 1999 Machine Learning
SVM: Simple Example
l
Data dimension: 2l Feature Space: 2nd order polynomial
4 dimensional
6 weights
8/9/2019 Machine Learning Course
127/326
8/9/2019 Machine Learning Course
128/326
1
Paul Viola 1999 Machine Learning
Zip Codes
Much Effort spent on
organizing the network
Paul Viola 1999 Machine Learning
SVM: Zip Code recogntion
l Data dimension: 256
l Feature Space: 4 th order roughly 100,000,000 dims
8/9/2019 Machine Learning Course
129/326
1
Paul Viola 1999 Machine Learning
SVM: Faces
Paul Viola 1999 Machine Learning
Support Vectors
8/9/2019 Machine Learning Course
130/326
1
Paul Viola 1999 Machine Learning
6.891 Machine Learning and Neural
Networks
Lecture 10:
Support Vector Machines
More Details and Derivations
Paul Viola 1999 Machine Learning
Newsl Quiz is 1 week from today.
l Problem set 4 will go out right after the quiz In one week its a pain to do two things at once.
l Problems set are very good (once again). On an absolute scale many of you are getting As.
8/9/2019 Machine Learning Course
131/326
2
Paul Viola 1999 Machine Learning
Pset 2
Paul Viola 1999 Machine Learning
Review & Overviewl Lecture 9:
Resurrecting Perceptrons
Setting up Support Vector Machines
l SVM review
l Why is it called Support Vectors??
l Derivation of some simpler properties.
8/9/2019 Machine Learning Course
132/326
3
Paul Viola 1999 Machine Learning
SVM: Key Ideas
l Augment inputs with a very large feature set Polynomials, etc.
l Use Kernel Trick(TM) to do this efficiently
l Enforce/Encourage Smoothness with weight penalty
Minimize
l Find best solution using Quadratic Programming
=i
i
T bbb 2 ibi 0
)(x
Avoid!
Paul Viola 1999 Machine Learning
Support Vectors
l Many of the bs are zero -- inactive constraints Only keep examples where
l Likely to generalize well VC Dimension -- later in the semester
1),(
i
ij
i cxKbj
constraintsubject to)min( wwT
=
i
i
i cxKbxy ),()(
0ib
8/9/2019 Machine Learning Course
133/326
8/9/2019 Machine Learning Course
134/326
5
Paul Viola 1999 Machine Learning
The optimal dividing line
l The optimal separatormaximizes the marginbetweenpositive and negative examples
iT
negativesxwd max=
iT
positivesxwd min=+
( )||
maxmarginmaxw
dd
ww
+=
||margin
w
dd +=
Paul Viola 1999 Machine Learning
Definition of the Margin
l
Margin: Gap between negatives and positivesmeasured perpendicular to a hyperplane
8/9/2019 Machine Learning Course
135/326
6
Paul Viola 1999 Machine Learning
Optimal dividing line=Support Vectors
iT
negativesxwd max=
iT
positivesxwd min=+
||max
w
dd
w
+
1 iTnegatives
xw
1 iTpositives
xw
wwT
min
Paul Viola 1999 Machine Learning
Lemma 1: Weight vectors are simple
l The weight vector lives in a sub-space spanned bythe examples
l Proved this to you by analyzing the perceptronweight update rule But we no longer use that rule!!!
Instead we use Quadratic Programming
=l
l
lbw x =l
l
lbw )(x
8/9/2019 Machine Learning Course
136/326
7
Paul Viola 1999 Machine Learning
Lemma 1: Kuhn-Tucker Conditions
1
1
1
3
2
1
xw
xw
xw
T
T
T
( )wwTmin
11
2
1
==
xw
xw
T
T
2
2
1
1 xxw bb +=
Some of the examples do not contribute
the inactive constraints.
Paul Viola 1999 Machine Learning
SVM versus Perceptronl Why not just use a perceptron?
Use all training points as a centers
Update using perceptron rule:
l
Perceptrons do not maximize the margin The estimated function is not terribly smooth
l Perceptrons do not rely on very few support vectors Yields a much more efficient classifier.
=
i
iT
i cxKwxy ),()(
),())((i
ii cxKxytww +=
8/9/2019 Machine Learning Course
137/326
8/9/2019 Machine Learning Course
138/326
9
Paul Viola 1999 Machine Learning
Support Vectors
Paul Viola 1999 Machine Learning
SVM: Difficultiesl How do you pick the kernels?
Intuition / Luck / Magic /
l What if the data is not linearly separable? i.e. the constraints cannot be satisfied
1),( +
j
i
ij
i scxKbj
+ j jT scbb ||min Slack Variables
8/9/2019 Machine Learning Course
139/326
8/9/2019 Machine Learning Course
140/326
8/9/2019 Machine Learning Course
141/326
8/9/2019 Machine Learning Course
142/326
8/9/2019 Machine Learning Course
143/326
3
Paul Viola 1999 Machine Learning
Optimal dividing line=Support Vectors
iT
negativesxwd max=
iT
positivesxwd min=+
||max
w
dd
w
+
1 iTnegatives
xw
1 iTpositives
xw
wwT
1max
||
1
wd
= ||
1
wd =+
wwww
ww
w
ddT
1
||
1
||
||
1
||
1
|| 2==
=
+
wwT
min
Paul Viola 1999 Machine Learning
Kernel Networks are Good for Regression
l The form of the Kernel determines the form ofthe final function Polynomial Kernels -> Polynomial Function
Gaussian Kernels -> Sum of Gaussians
l The Common Error Criteria is squared error
=
i
i
i cxKbxy ),()( =i
i
icxKbxy ),()(
( ) =j
jj txyError2
)(
This ends up being exactly like polynomial fitting
except that there is one weight per data point
This ends up being exactly like polynomial fitting
except that there is one weight per data point
8/9/2019 Machine Learning Course
144/326
4
Paul Viola 1999 Machine Learning
Radial Basis Function Networks
l When we restrict ourselves to Kernels which areradially symmetric, the resulting network is calleda Radial Basis Function Network K only depends on the radial distance from some
datapoint c.
Poggio & Girosi pioneered the use of these.
|)(|),( cxKcxK = =i
i
i cxKbxy |)(|)(
Paul Viola 1999 Machine Learning
2 4 6 8 100
0.5
1
1.5
2
2.5
3
rom Smoothness to Kernels:
Assumptions are Necessary
0 2 4 6 8 101
1.5
2
2.5
3
Intuition
? ?
8/9/2019 Machine Learning Course
145/326
5
Paul Viola 1999 Machine Learning
Setting up the problem
W = 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
T =1
0 0 0
3 0 0 0 0 1
TWWWY TT 1)( =
( )2TWYCost =
Not Invertible!
Paul Viola 1999 Machine Learning
Conditioning the Problem
TWIWWY TT 1)( +=
( ) YYTWYCost T+= 2 =i
i
TyYY
2
Small solution
vectors are best.Y =
1 0 0 0 3 0 0 0 0 1
T =1
0 0 0 3 0 0 0 0 1
2 4 6 8 100
0.5
1
1.5
2
2.5
3
8/9/2019 Machine Learning Course
146/326
6
Paul Viola 1999 Machine Learning
and the winner is?
2 4 6 8 100
1
2
3
4
5
6
2 4 6 8 100
0.5
1
1.5
2
2.5
3
2 4 6 8 100
0.5
1
1.5
2
2.5
3
This is not always true
remember to think like a
Bayesian
Paul Viola 1999 Machine Learning
Smooth is Good: Regularization
l Alternative way to motivate Kernel Networks.
( )
Cost y Error y Smoothness y
y x tx
y x dxj j
j
( ) ( ) ( )
( )$
( $) $
= +
= +
2
2
8/9/2019 Machine Learning Course
147/326
8/9/2019 Machine Learning Course
148/326
8
Paul Viola 1999 Machine Learning
Need to find lambda
Y=1.0000
1.5000 2.0000 2.5000 3.0000 2.6000 2.2000 1.8000 1.4000 1.0000
001.0=
2 4 6 8 100
0.5
1
1.5
2
2.5
3
10=
Y = 1.6000 1.6600
1.7200 1.7800 1.8400 1.7840 1.7280 1.6720 1.6160 1.5600
0.3
2 4 6 8 100
0.5
1
1.5
2
2.5
3
1.8
Paul Viola 1999 Machine Learning
Derivative Order Controls Shape
1 0 2 0 30 40 500
0. 5
1
1. 5
2
2. 5
3
1 0 2 0 30 40 500
0. 5
1
1. 5
2
2. 5
3
8/9/2019 Machine Learning Course
149/326
8/9/2019 Machine Learning Course
150/326
10
Paul Viola 1999 Machine Learning
Still Piecewise Cubic
10 20 30 40 50
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
10 20 30 40 50
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Paul Viola 1999 Machine Learning
Smoothness is easily controlled
10 20 30 40 50
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
10 20 30 40 50
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
10 20 30 40 50
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
10 20 30 40 50
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
10 20 30 40 50
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
10 20 30 40 50
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
8/9/2019 Machine Learning Course
151/326
11
Paul Viola 1999 Machine Learning
Regularization to RBFs
l Alternative way to motivate RBFs
)()()( ySmoothnessyErroryE +=
Gaussian
Every
Training Point
Paul Viola 1999 Machine Learning
roblem: To many centers
(Old solutions )
l One per data point can be way to many
l Choose a random subset of the points Hope you dont get unlucky
l Distribute them based on the density of points Perhaps EM clustering
8/9/2019 Machine Learning Course
152/326
12
Paul Viola 1999 Machine Learning
Too Many Centers 2
l Put them where you need them To best approximate your function
l Compute the derivative of E(y) w.r.t. the centers This gets very hairy and does not work well
Too many local minima -- no small weight trick
Paul Viola 1999 Machine Learning
Too Many Centers 3
l Support Vector Regression Next time.
8/9/2019 Machine Learning Course
153/326
1
Paul Viola 1999 Machine Learning
6.891 Machine Learning and Neural
Networks
Lecture 12:
Smooth Functions and Kernel Networks
Paul Viola 1999 Machine Learning
Newsl Quiz was too hard
I am trying to come up with a creative grading scheme.
Best 5 out of 6 problems???
First let us do the grading.
l Problem set will be out by tonight.
8/9/2019 Machine Learning Course
154/326
2
Paul Viola 1999 Machine Learning
Review & Overview
l Lecture 11: Trying to find smooth functions.
l Smooth Regression Another way of motivating Kernel networks
Paul Viola 1999 Machine Learning
where we were last time.
Squared 1st
Derivative
2 4 6 8 1 0
0
2
4
6
8
10
12
14
16
Sum = 49.7
2 4 6 8 1 00
1
2
3
4
5
6
7
8
9
Sum = 20
2 4 6 8 100
0.5
1
1.5
2
2.5
3
Sum = 1.8
0 2 4 6 8 101
1.5
2
2.5
3
8/9/2019 Machine Learning Course
155/326
3
Paul Viola 1999 Machine Learning
Regression Review
l Up until now we have been mostly analyzingclassification:
X, inputs. Y, classes. Find the best c(x).
l Today: Regression.
X, inputs. Y, outputs. Find the bestf(x).
Predict the stocks value next week.
Picture of Road -> Car steering wheel
etc.
( )2),(min j
jj
wywxf
Paul Viola 1999 Machine Learning
Schemes for motivating regressionl Prior assumptions
Find the best polynomial which fits the data:
Or find the best neural network, or ???
l Bayesian Approach Find the most likely function:
K+++= 2210
),( xwxwwwxf ( )2
),(min j
jj
wywxf
( )},{)()|},({
max}),{|(maxjj
jj
f
jj
f yxp
fpfyxpyxfp =
8/9/2019 Machine Learning Course
156/326
4
Paul Viola 1999 Machine Learning
Bayesian framework captures
many approaches
( )},{)()|},({
max}),{|(maxjj
jj
f
jj
f yxp
fpfyxpyxfp =
( )},{log)(log)|},({logmax jjjjf
yxpfpfyxp +
( )2)(
))((log
)|,(log
)|,(log)},({log
=
=
=
=
j
jj
j
jj
j
jj
j
jjjj
yxfc
yxfG
fyxp
fyxpfyxp
=otherwise0
polyaisif)(
ffp
The polynomial that
fits the data best is the
most likely function
Paul Viola 1999 Machine Learning
Bayesian framework captures
many approaches
( )2)(
)},({log
=j
jj
jj
yxfc
fyxp
=
2
)(logx
ffp
The function which both fits the data and has
a small derivative is the most likely
=
2
2
2
)(log x
f
fpAlso popular
8/9/2019 Machine Learning Course
157/326
5
Paul Viola 1999 Machine Learning
A closer look...
( )
+=2
2)()(
x
fyxfcfC
j
jj
=
=
1
3
1
10
5
1
YX
Data
l How do we minimize this function? The set of possible functions is infinite
The space of functions is infinite dimensional
l Constrain f to be: polynomial, sum of exponentials, etc??
l What about unconstrained solutions...
Paul Viola 1999 Machine Learning
l Butfis not a scalar!!
l In factfis more like an infinite dimensional vector.
l If f were a finite vector:
0)( =
df
fdC
A slightly simpler problem
0)( =
jf
fCj
Can not reduce C()
by adjusting any of the
parameters of f.
( )
+=2
2)()(
x
fyxfcfC
j
jj
8/9/2019 Machine Learning Course
158/326
6
Paul Viola 1999 Machine Learning
We could approximate f .
0 0 . 1 0 . 2 0. 3 0.4 0.5 0. 6 0.7 0.8 0.9 1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0 . 6 0. 7 0.8 0.9 1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0 . 6 0.7 0.8 0. 9 1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 0.1 0. 2 0.3 0.4 0.5 0 . 6 0. 7 0.8 0.9 1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
10 20
100 1000
Paul Viola 1999 Machine Learning
2 4 6 8 100
0.5
1
1.5
2
2.5
3
Looking for smooth solutions...
0 2 4 6 8 101
1.5
2
2.5
3
Intuition
? ?
8/9/2019 Machine Learning Course
159/326
7
Paul Viola 1999 Machine Learning
Setting up the problem
W = 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Y =1
0 0 0 3 0 0 0 0 1
YWWWF TT 1)( =
( )2YWFCost =
Not Invertible!
F =1
22 0 0 3 0 0 6 0 1
=
Paul Viola 1999 Machine Learning
Conditioning the Problem
YWIWWF TT 1)( +=
( ) FFYWFCost T+= 2 =i
i
TfFF
2
Small solution
vectors are best.F =
1 0 0
0 3 0 0 0 0 1
Y =1
0 0
0 3 0 0 0 0 1
2 4 6 8 100
0.5
1
1.5
2
2.5
3
8/9/2019 Machine Learning Course
160/326
8
Paul Viola 1999 Machine Learning
and the winner is?
2 4 6 8 100
1
2
3
4
5
6
2 4 6 8 100
0.5
1
1.5
2
2.5
3
2 4 6 8 100
0.5
1
1.5
2
2.5
3
This is not always true
remember to think like a
Bayesian
Paul Viola 1999 Machine Learning
Smooth is Good: Regularization
l Alternative way to motivate Kernel Networks.
( ) xdxfx
yxf
fSmoothnessfErrorfCost
j
jj )(
)(
)()()(
22
+=
+=
8/9/2019 Machine Learning Course
161/326
9
Paul Viola 1999 Machine Learning
Setting up the Problem
W = 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
Y =1
0 0 0
3 0 0 0 0 1
D = 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0
0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1
YWDDWWF TTT 1)( +=
( ) ( )22 DFTWFCost +=
F=1.0000
1.5000 2.0000 2.5000
3.0000 2.6000 2.2000 1.8000 1.4000 1.0000
Paul Viola 1999 Machine Learning
Need to find lambda F=1.0000
1.5000 2.0000 2.5000 3.0000 2.6000 2.2000 1.8000 1.4000 1.0000
001.0=
2 4 6 8 100
0.5
1
1.5
2
2.5
3
10=
F = 1.6000
1.6600 1.7200 1.7800 1.8400 1.7840 1.7280 1.6720 1.6160 1.5600
0.3
2 4 6 8 100
0.5
1
1.5
2
2.5
3
1.8
8/9/2019 Machine Learning Course
162/326
10
Paul Viola 1999 Machine Learning
Derivative Order Controls Shape
1 0 2 0 30 40 500
0. 5
1
1. 5
2
2. 5
3
1 0 2 0 30 40 500
0. 5
1
1. 5
2
2. 5
3
2
2
2
x
f
2
x
f
Paul Viola 1999 Machine Learning
A Closer Look
1 0 2 0 30 40 50
-1
-0.5
0
0. 5
1
1. 5
2
2. 5
3
1 0 2 0 30 40 50
-0.5
0
0. 5
1
1. 5
2
2. 5
3
Linear + Kinks Cubics + Kinks
8/9/2019 Machine Learning Course
163/326
11
Paul Viola 1999 Machine Learning
Look at the regularizer...
D = 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1
YWDDWWF TTT 1)( +=
( ) ( )22 DFTWFCost +=
D * D =
1 -1 0 0 0 0 0 0 0 0
-1 2 -1 0 0 0 0 0 0 0
0 -1 2 -1 0 0 0 0 0 0
0 0 -1 2 -1 0 0 0 0 0
0 0 0 -1 2 -1 0 0 0 0 0 0 0 0 -1 2 -1 0 0 0
0 0 0 0 0 -1 2 -1 0 0
0 0 0 0 0 0 -1 2 -1 0
0 0 0 0 0 0 0 -1 2 -1
0 0 0 0 0 0 0 0 -1 1
Paul Viola 1999 Machine Learning
Second Deriv -> Fourth Deriv
D =
1 -1 0 0 0 0 0 0 0 0
-1 2 -1 0 0 0 0 0 0 0
0 -1 2 -1 0 0 0 0 0 0
0 0 -1 2 -1 0 0 0 0 0
0 0 0 -1 2 -1 0 0 0 0
0 0 0 0 -1 2 -1 0 0 0
0 0 0 0 0 -1 2 -1 0 0
0 0 0 0 0 0 -1 2 -1 0
0 0 0 0 0 0 0 -1 2 -1
0 0 0 0 0 0 0 0 -1 1
D' * D =
1 -2 1 0 0 0 0 0 0 0
-2 5 -4 1 0 0 0 0 0 0
1 -4 6 -4 1 0 0 0 0 0
0 1 -4 6 -4 1 0 0 0 0
0 0 1 -4 6 -4 1 0 0 0
0 0 0 1 -4 6 -4 1 0 0
0 0 0 0 1 -4 6 -4 1 0
0 0 0 0 0 1 -4 6 -4 1
0 0 0 0 0 0 1 -4 5 -2
0 0 0 0 0 0 0 1 -2 1
8/9/2019 Machine Learning Course
164/326
8/9/2019 Machine Learning Course
165/326
13
Paul Viola 1999 Machine Learning
Still Piecewise Cubic
10 20 30 40 50
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
10 20 30 40 50
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Paul Viola 1999 Machine Learning
Smoothness is easily controlled
10 20 30 40 50
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
10 20 30 40 50
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
10 20 30 40 50
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
10 20 30 40 50
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
10 20 30 40 50
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
10 20 30 40 50
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
8/9/2019 Machine Learning Course
166/326
14
Paul Viola 1999 Machine Learning
Regularization to RBFs
l Alternative way to motivate RBFs
)()()( ySmoothnessyErroryE +=
Gaussian
Every
Training Point
Paul Viola 1999 Machine Learning
roblem: To many centers
(Old solutions )l One per data point can be way to many
l Choose a random subset of the points Hope you dont get unlucky
l Distribute them based on the density of points Perhaps EM clustering
8/9/2019 Machine Learning Course
167/326
15
Paul Viola 1999 Machine Learning
Too Many Centers 2
l Put them where you need them To best approximate your function
l Compute the derivative of E(y) w.r.t. the centers This gets very hairy and does not work well
Too many local minima -- no small weight trick
Paul Viola 1999 Machine Learning
Too Many Centers 3l Support Vector Regression
xdxfx
yxf
fSmoothnessfErrorfCost
j
jj )(
)(
)()()(
2
+=
+=
=j
jj xxKwxf ),()(
Many js
are zero!!!
8/9/2019 Machine Learning Course
168/326
1
Paul Viola 1999 Machine Learning
6.891 Machine Learning and Neural
Networks
Lecture 13:
Kernel Networks
on to Unsupervised Learning
Paul Viola 1999 Machine Learning
Newsl Quizes are graded
Each problem has been graded.
** The overall score for the quiz is being determined. We ran out of time last night.
l Course grading: (approximate) Psets: 35%
Quiz: 20%
Final: 30%
Project: 10%
Participation: 5%
8/9/2019 Machine Learning Course
169/326
2
Paul Viola 1999 Machine Learning
Pset 3
You are doing spectacularly well
Paul Viola 1999 Machine Learning
Exams
8/9/2019 Machine Learning Course
170/326
3
Paul Viola 1999 Machine Learning
Grading alternatives
Paul Viola 1999 Machine Learning
Review & Overviewl Lecture 12:
Trying to find smooth functions.
Requiring smoothness simplifies functions: 1st deriv -> piecewise linear; 2nd deriv -> cubic
l Finish off Regression
l Begin unsupervised learning. PCA, etc
8/9/2019 Machine Learning Course
171/326
8/9/2019 Machine Learning Course
172/326
8/9/2019 Machine Learning Course
173/326
6
Paul Viola 1999 Machine Learning
Cubics are similar...
( )
+=
j
jj
x
fyxffC
2
2
22
)()(
=
j
j
jxxKbxf ),()(
?
2)(),( jjj xxxxxxK =)(),(
4
4j
j
xx
x
xxK=
=j
j
j xxaxf )()(
=j
j
jx
xxKbxf
4
4 ),()(
Paul Viola 1999 Machine Learning
Can also get gaussian kernels
l if you want them!!
)()()( ySmoothnessyErroryE +=
Gaussian
Every
Training Point
8/9/2019 Machine Learning Course
174/326
7
Paul Viola 1999 Machine Learning
Smoothness is easily controlled
10 20 30 40 50
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
10 20 30 40 50
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
10 20 30 40 50
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
10 20 30 40 50
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
10 20 30 40 50
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
10 20 30 40 50
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
Paul Viola 1999 Machine Learning
roblem: To many centers
(Old solutions )l One per data point can be way too many
l Choose a random subset of the points Hope you dont get unlucky
l Distribute them based on the density of points Perhaps EM clustering
8/9/2019 Machine Learning Course
175/326
8
Paul Viola 1999 Machine Learning
Too Many Centers 2
l Put them where you need them To best approximate your function
l Compute the derivative of C(f) w.r.t. the centers This gets very hairy and does not work well
Too many local minima -- no small weight trick
Paul Viola 1999 Machine Learning
Support Vector Regressionl Support Vector Regression
bxwxf T +=)( wwTmin
( )
( )
+
+
bxwy
ybxw
jTj
jjT
8/9/2019 Machine Learning Course
176/326
8/9/2019 Machine Learning Course
177/326
1
Paul Viola 1999 Machine Learning
New Topic: Unsupervised Learning
l What can you do to understand data when youhave no labels? Find unusual structure in the data.
Find simplifications of the data.
l Find the clusters in the data: Fit a mixture of gaussians
Been there, done that.
l Reduce the dimensionality of the data: Find a linear projection from high to low dimensions
l These all amount to density estimationl There are many other approaches
Build a tree which captures the data, etc.
Paul Viola 1999 Machine Learning
8 bits per pixel .36 bits per pixel
8/9/2019 Machine Learning Course
178/326
1
Paul Viola 1999 Machine Learning
Paul Viola 1999 Machine Learning
8/9/2019 Machine Learning Course
179/326
1
Paul Viola 1999 Machine Learning
Paul Viola 1999 Machine Learning
8/9/2019 Machine Learning Course
180/326
1
Paul Viola 1999 Machine Learning
6.891 Machine Learning and Neural
Networks
Lecture 14:
on to Unsupervised Learning
Paul Viola 1999 Machine Learning
Newsl I will try to give you a feeling for where we are
headed: Next 4 lectures
Bayes Nets / Graphical Models / Boltzman Machines /HMMs
After that a series of topics ( from papers).
8/9/2019 Machine Learning Course
181/326
2
Paul Viola 1999 Machine Learning
Review & Overview
l Lecture 13: The end of regression
l Begin unsupervised learning. PCA, etc
Paul Viola 1999 Machine Learning
New Topic: Unsupervised Learningl What can you do to understand data when you
have no labels? Find unusual structure in the data.
Find simplifications of the data.
l Find the clusters in the data: Fit a mixture of gaussians
Been there, done that.
l Reduce the dimensionality of the data:
Find a linear projection from high to low dimensionsl These all amount to density estimation
l There are many other approaches Build a tree which captures the data, etc.
8/9/2019 Machine Learning Course
182/326
3
Paul Viola 1999 Machine Learning
Exploratory Data Analysis
l Machine learning is simply not that smart
l It is still very important to look at the data.
l But when there are millions of examples andthousands of dimensions you cannot look at thedata.
l It is very important to summarize the data Summarize the statistics
Reduce dimensionality
Paul Viola 1999 Machine Learning
Example: Mixture of Gaussiansl Start out with a many training points (1000s)
Distributed in a clumpy fashion.
l Replace with a few summary clusters (10s) One for each clump.
l Why? Better understand/summarize the data
Demographics Data: growing number of stay at home dadsl Salary < 1000; Children > 0; Family Salary > 20,000, etc.
Astronomical Data: Unusual distribution of stars near galaxy
Extract symbolsl one cluster per word (in speech), one cluster per letter (in writing)
Speed learning Learn with the clusters rather than all the data approximate
8/9/2019 Machine Learning Course
183/326
4
Paul Viola 1999 Machine Learning
Example of Clustering
l Can I get some examples of clustering???
l PDP??
l Andrew Moore
Paul Viola 1999 Machine Learning
l Regression & Kernel Networks
Learning time is cubic in the number of examples.
Need to solve the linear system to find weights.
Performance time is linear in the number of examples
10,000 examples -> 100 clusters??
l Problem: these clusters are not optimal
Other locations may be much more effective forapproximating the target function.
Clusters congregate near data not at the margin!
Speed Learning???
=j
j
jxxKbxf ),()(
8/9/2019 Machine Learning Course
184/326
5
Paul Viola 1999 Machine Learning
Support Vector Machines
l Claims to choose the optimal set of examples 10,000 -> 100
Performance is optimal
l But, training time is still cubic in the number ofexamples.
l Though some new algorithms are beginning to work
more quickly.
=j
j
jxxKbxf ),()(
Paul Viola 1999 Machine Learning
Example: Dimensionality Reductionl It is difficult to visualize 10, 20, 1000 dimensions
l Worse, there is reason to believe that it can bevery difficult to learn in high dimensions.
l Curse of dimensionality: As the dimensionality of the data grows, the average
distance between points also grows.
As a result it is difficult to get an accurate localestimate for what is going on.
8/9/2019 Machine Learning Course
185/326
6
Paul Viola 1999 Machine Learning
Curse of Dimensionality of Nearest
Neighbor
l How far is it to your nearest neighbor?? Easier Question: How far do you have to look before
expecting 1 neighbor.
Assume points are distributed in a sphere of unit radius
The point in the center of the sphere should have thelargest number of neighbors. How far do we have to go
to find 1 neighbor on average??
k
kkrcrS =))((Vol
Paul Viola 1999 Machine Learning
Reducing Dimensionality Linearly
l Where W has fewer rows than columns
l What is the right W?? One that preserves as much information as possible.
Ah, but what is information??
Wxy=
( )( )2min WxWxEW
=
M
2
1
e
e
W
8/9/2019 Machine Learning Course
186/326
7
Paul Viola 1999 Machine Learning
First Eigenvector preserves more info...
Paul Viola 1999 Machine Learning
256,000Numbers
4,000Numbers
8/9/2019 Machine Learning Course
187/326
8/9/2019 Machine Learning Course
188/326
9
Paul Viola 1999 Machine Learning
Dimensionality Reduction
Paul Viola 1999 Machine Learning
Cottrell and Metcalfe
8/9/2019 Machine Learning Course
189/326
1
Paul Viola 1999 Machine Learning
Information Theory for Signal
Separation
l The Cocktail Party Problem
l Many Speakers -- the signals a hopelessly mixed
Microphones
Sound
Sources
Paul Viola 1999 Machine Learning
Unmixed
8/9/2019 Machine Learning Course
190/326
1
Paul Viola 1999 Machine Learning
ets look at data
Figures from Christian
Unmixed Mixed
PCAUnmixed
Paul Viola 1999 Machine Learning
Mathematical Assumptions
l Assumptions:
Sounds Travels instantaneously Sound Mixes Linearly
Signals are independent
S
s t
s t
s t
M
m t
m t
m t
A
a a a
a a a
a a a
=
=
=
1
2
3
1
2
3
11 12 13
21 22 23
31 32 33
( )
( )
( )
( )
( )
( )
M AS=
8/9/2019 Machine Learning Course
191/326
1
Paul Viola 1999 Machine Learning
The Unmixing Problem
l We would like to undo the mixing...
$S A AS = 1
Paul Viola 1999 Machine Learning
Reducing Dimensionality Non-Linearly
l Where W has fewer rows than columns
l What is the right W?? One that preserves as much information as possible.
Ah, but what is information??
)(Wxgy=
( )xWxgMIW
),(max
=
components
tindependenW
8/9/2019 Machine Learning Course
192/326
1
Paul Viola 1999 Machine Learning
ICAUnmixed
Paul Viola 1999 Machine Learning
Learning Rule
WuyW
WWxyW
WWxyWW
T
TT
TTT
)21(
)21(
))21((
+=
+=
+=
8/9/2019 Machine Learning Course
193/326
1
Paul Viola 1999 Machine Learning
6.891 Machine Learning and Neural
Networks
Lecture 15:
Reasoning and Learning on Discrete Data
Bayes Nets
Paul Viola 1999 Machine Learning
Newsl Final Problem Set will be ready tomorrow
Mostly Bayes Nets
l Please begin to think about your final project
8/9/2019 Machine Learning Course
194/326
2
Paul Viola 1999 Machine Learning
Review & Overview
l Lecture 14: Principal Components Analysis
A low dimensional projection is can summarize data
Independent Components Analysis An alternative to PCA which can pick out the independent
sources of data.
l Bayes Nets Meeting of the minds
Artificial Intelligence and Machine Learning
Represents symbolic knowledge and reasoning Principled mechanism for inference and learning
Bayes Rule
Paul Viola 1999 Machine Learning
Artificial Intelligencel Build systems that reason about the world:
Diagnosis Why wont my car start?
Goal directed behavior How can I get from here to the White House?
Space Probe: How do I change orbit, take photos of Mars,and communicate with Earth in the next 5 minutes?
How can I symbolically integrate this function?
Game Playing
How can I beast Kasparov?l Biases:
Symbolic data and symbolic problems (not continuous)
No representation of uncertainty or probability.
8/9/2019 Machine Learning Course
195/326
3
Paul Viola 1999 Machine Learning
Techniques in Artificial Intelligence
l Write down a set of rules that govern the world If I get on a plane to Wash. DC then I will end up in DC.
If I take a taxi to Logan then I will end up at Logan.
If my car is out of fuel then it wont start.
If my starter motor is broken then it wont start.
The integral of sin(x) is - cos(x).
The integral of 2