Machine Learning Course

Embed Size (px)

Citation preview

  • 8/9/2019 Machine Learning Course

    1/326

    www.GetPedia.com

    * The Ebook starts from the next page : Enjoy !

    http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/
  • 8/9/2019 Machine Learning Course

    2/326

    6.891 Machine Learning

    6.891 Machine Learning and Neural

    Networks

    6.891 Machine Learning

    News

    http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/
  • 8/9/2019 Machine Learning Course

    3/326

    6.891 Machine Learning

    Review & Overview

    6.891 Machine Learning

    Course Information

    http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/http://www.getpedia.com/
  • 8/9/2019 Machine Learning Course

    4/326

    6.891 Machine Learning

    Grading Experiment!!!

    6.891 Machine Learning

    Course Goals

  • 8/9/2019 Machine Learning Course

    5/326

    6.891 Machine Learning

    Course Goals:

    NIPS 1989

    6.891 Machine Learning

    Course Goals:

    Pitts and McCulloch, 1947

  • 8/9/2019 Machine Learning Course

    6/326

    6.891 Machine Learning

    Goals: Analysis and Computation

    6.891 Machine Learning

    What is Machine Learning?

  • 8/9/2019 Machine Learning Course

    7/326

    6.891 Machine Learning

    Physical Laws

    Newtons Measurements

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

    Force

    Acceleration

    6.891 Machine Learning

    Physical Laws: Theorize

    F=ma

    Ignore Errors &

    Inconsistencies

    Newtons Measurements

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

    Force

    Acceleration

  • 8/9/2019 Machine Learning Course

    8/326

    6.891 Machine Learning

    Different Types of Learned Relations

    F = ma, pv = nrt

    6.891 Machine Learning

    Some Notation

    ( )Tdxxx ,,, 21 =x

    { ,, 21 CCC=

    ( )Tyy ,, 21=y

    jjx xor

    j-th examplej-th example

    jj yC &

    jt

  • 8/9/2019 Machine Learning Course

    9/326

    6.891 Machine Learning

    Additional Notation (Abusive!)

    ( )jj xyy = ( )jj xCC =

    =j

    jj xytlE )(( =

    =otherwise

    zifzl

    1

    00)(

    2)( zzl = LossLoss

    6.891 Machine Learning

    Example: Digit Recognition

  • 8/9/2019 Machine Learning Course

    10/326

  • 8/9/2019 Machine Learning Course

    11/326

    6.891 Machine Learning

    Tremendous Variety

    6.891 Machine Learning

    Hand Labeled Data

  • 8/9/2019 Machine Learning Course

    12/326

    6.891 Machine Learning

    Final Performance

    6.891 Machine Learning

    Differentiating Speech & Music

    The Key Issue: Features!

  • 8/9/2019 Machine Learning Course

    13/326

    6.891 Machine Learning

    Speech Recognition

    Sound Now is the time...

    6.891 Machine Learning

    Evaluation of Credit Risk

  • 8/9/2019 Machine Learning Course

    14/326

    6.891 Machine Learning

    Digit Recognition in Detail

    1 2

    Perim

    Width Height

    Num Black Pixels

    6.891 Machine Learning

    Rules for Classification

    1 2

    One of the featuregraphs here

    )()( baxxC +=

    otherwise

    yify

    0

    01)(

  • 8/9/2019 Machine Learning Course

    15/326

    6.891 Machine Learning

    UsingBayes Law

    1 2

    )(

    )"2(")"2|"()|"2("

    fFP

    PfFPfFP

    =

    ===

    P F f

    P F f P

    P F f(" "| )

    ( |" ") (" ")

    ( )1

    1 1

    = =

    =

    =

    P B AP A B P B

    P A( | )

    ( | ) ( )

    ( )=

    6.891 Machine Learning

    Combining Features

    x1

    x2

  • 8/9/2019 Machine Learning Course

    16/326

    6.891 Machine Learning

    6.891 Machine Learning and Neural

    Networks

  • 8/9/2019 Machine Learning Course

    17/326

    6.891 Machine Learning

    News

  • 8/9/2019 Machine Learning Course

    18/326

    6.891 Machine Learning

    Review & Overview

  • 8/9/2019 Machine Learning Course

    19/326

    6.891 Machine Learning

    Fitting a Curve to Data

    0 0.5 1

    0

    0.2

    0.4

    0.6

    0.8

    1

    { }jj tx ,:Data

    ?? ??

    )(xy

    You are given 10 example data points. These are samples of physicalrelationship, perhaps including noise.

    You challenge is to make prediction for this relationship

    - Interpolation: between the example points - Extrapolation: beyond the data.

    In principle there are an infinite number of functions that could be associatedwith this data our challenge is to pick one.

    In the final analysis we may want to hedge our bets and return a probabilitydistribution of functions.

  • 8/9/2019 Machine Learning Course

    20/326

    6.891 Machine Learning

    Polynomial Fitting

    { }jj

    tx ,:Data

    ( ) ( ) ( )=+++=M

    l

    j

    l

    Mj

    M

    jj lxwxwxwwxy

    1

    10)(

    { },,:torWeight vec 10 ww=w

    );( wjxy

    Imagine that we are constraining ourselves to class of polynomials.

    Each M-th order polynomial is parameterized by M+1 parameters.

    The learning process, becomes a process by which we select values of W_i,

    The dependency of the y(.) function on W will can be highlighted by thenotation y(x; w).

  • 8/9/2019 Machine Learning Course

    21/326

    6.891 Machine Learning

    Graphical Representation

    ( )=M

    l

    lj

    l

    jxwxy )(

    X

    X0 X1 X2 X3 X9

    y

    w0 w1 w2 w3w9

    ...

    Scalar Representation

    High Dimensional

    Non-linear Representation

    - While the algebraic notation for y() is clear and specific, we will see thatsometimes is also useful to develop a graphical notation of both classifiersand regression functions.

    - This idea was originally popularized in the neural network literature,wherein neural networks were almost always drawn out in their graphicalform.

    - The graphical notation points out that an intermediaterepresentation for Xis formed (M+1 exponentiations). The resulting problem is then one oflearning the linear relationship between this high dimensional space and t.

  • 8/9/2019 Machine Learning Course

    22/326

    6.891 Machine Learning

    ( ) =j

    jjtxylossE );(

    2

    1w

    EmpiricalLoss

    Ew

    min =w Find Optimalweights

    Choose the Best Polynomial

    - What defines the best polynomial function? Perhaps it is the one which ismost consistent with the training data?

    - Actually we would rather return the function which makes the bestpredictions on future data - unfortunately there may be no source for thisdata.

    - For the time being lets assume that we want to find the function which bestagrees with training data the function with the lowest loss.

  • 8/9/2019 Machine Learning Course

    23/326

    6.891 Machine Learning

    Simple Loss Functions Simplify Learning

    ( ) =j

    jjtxyE 2);(

    21 w( ) 2 =loss

    ( )( )

    ( )( ) 0);(

    0);(22

    1

    ==

    ==

    j

    ijjj

    j

    ijjj

    i

    xtxy

    xtxyw

    E

    w

    w

    - Certain simple loss functions lead to learning algorithms which are easy toderive and inexpensive to compute.

    - For example, squared loss can be solved by differentiating and setting thisto zero.

    - The result is a set of linear equations that can be solved by inverting amatrix.

  • 8/9/2019 Machine Learning Course

    24/326

    6.891 Machine Learning

    First order fit

    0 0.5 1

    0

    0.2

    0.4

    0.6

    0.8

    1

    ( ) =j

    jj txyE2

    );(2

    1w

    The optimal

    function minimizes

    the residual error

    - There is a pleasant physical analogy for the squared loss. The functionsare connected by springs to the data. The system is then allowed to relaxuntil the forces are balanced. The minimum energy solution is the one thatis closest to the training data.

  • 8/9/2019 Machine Learning Course

    25/326

    6.891 Machine Learning

    Fitting Different Polynomials

    0 0.5 1

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.5 1

    0

    0.2

    0.4

    0.6

    0.8

    1

    1

    2

    0 0.5 1

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.5 1

    0

    0.2

    0.4

    0.6

    0.8

    1

    3

    4

    0 0.5 1

    0

    0.2

    0.4

    0.6

    0.8

    1

    9

    0 0.5 1

    0

    0.2

    0.4

    0.6

    0.8

    1

    6

    Each order of polynomial leads to a different fit.

    Higher order polynomials come closer to the training data.

    The 9th order polynomial can fit the 10 datapoints perfectly.

    Which of these is the most likely to generalize

  • 8/9/2019 Machine Learning Course

    26/326

    6.891 Machine Learning

    Target Function

    )2sin(4.05.0)( xxh +=

    )(jj

    xht =

    0 0.2 0.4 0.6 0.8 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    The function that generated the data was not a polynomial at all.

  • 8/9/2019 Machine Learning Course

    27/326

  • 8/9/2019 Machine Learning Course

    28/326

    6.891 Machine Learning

    Matlab Code

    % Construct training data

    train_in = [1:10]/10;

    train_out = 0.5 + 0.4 * sin(2 * pi * train_in) + 0.1 * randn(size(train_in));

    % fit a polynomial

    order = 3

    p = polyfit(train_in, train_out, order)

    % construct a test set

    test_in = [1:300]/300;

    true_out = 0.5 + 0.4 * sin(2 * pi * test_in);

    % compute the polynomial prediction

    fit_out = polyval(p, test_in);

    % plot the results

    % first: training data

    % second: test data

    % third: predictionsplot(train_in, train_out, o, test_in, true_out, test_in, fit_out)

    axis([0 1 -0.1 1.1])

    Above is all the code used to generate the previous graphs.

    As we can see Matlab allows us to explore issues in machine learningwithout much hacking.

  • 8/9/2019 Machine Learning Course

    29/326

    6.891 Machine Learning

    First General Problem in Learning

    There are a few general (grand challenge) problems in machine learning.Perhaps the most important is the problem of controlling complexity. Wehave seen that a simpler approximator can fit an unknown function betterthan a more complex approximator. Building a theory for this is one grandchallenge of learning.

    Clearly this problem has been appreciated for a long time.

    Vapniks statement is perhaps the most confusing. When he says theory

    what he means is something like a learning algorithm. The learningalgorithm which fits 9th order polynomials to 10 datapoints is not falsifiable.

    - No set of datapoints would fail to be fit perfectly by this data.

  • 8/9/2019 Machine Learning Course

    30/326

    6.891 Machine Learning

    Overfitting in Classification

    This is not to say that such problems are uniuqe to regression. Determiningdecision boundaries for classification is very similar.

    We need to balance the complexity of the boundary against the accuracy on

    training data.

  • 8/9/2019 Machine Learning Course

    31/326

  • 8/9/2019 Machine Learning Course

    32/326

    6.891 Machine Learning

    Recall the probabilistic approach

    P(F|C1) P(F|C2)

  • 8/9/2019 Machine Learning Course

    33/326

    6.891 Machine Learning

    Probabilistic Approach

    Thomas Bayes1702-1761

    )(

    )()|()|(

    xXP

    CCPCCxXPxXCCP kkk

    =

    ======

    )(

    )()|()|(

    xP

    CPCxPxCP kkk =

    ),( xXCCP k ==

    )2|(&)1|( CXPCXP

  • 8/9/2019 Machine Learning Course

    34/326

    6.891 Machine Learning

    Probability Densities

    ==b

    a

    dxxXpbaXP )(]),[(

    )()( xXpxpp(x) X ===

    )(

    )()|()|(

    xXp

    CCPCCxXpxXCCP kkk =

    ======

    )()()|()|(

    xpCPCxpxCP kkk =

    ]),[()( baXPdbdxXp ==

    Probability Densities are necessary because for a continuous randomvariable the probability of every event is zero.

    The density measures the slope of the cumulative distribution function.

    Alternatively it is the probability per unit area (or length, or volume)

    measured over an infinitesimal area.

    Somewhat surprisingly the density used in the same way that the distributionfunction is used. In other words the probability distributionof the class givethe feature value can be found using the densities of the features.

  • 8/9/2019 Machine Learning Course

    35/326

    6.891 Machine Learning

    Bayes Law for Densities

    >

    =otherwise

    )|()|()(

    2

    211

    xPxPxC

    Duda & Hart, 1973

    Class 1 Class 2

    Just as before we can graph the conditional probability of class givenfeature. The functions are now continuous

    Given the Bayes classification rule, a set of decision regions are defined.

  • 8/9/2019 Machine Learning Course

    36/326

    6.891 Machine Learning

    Decisions

    These analyses are easily generalized to:

    - Multiple classes

    - Multiple dimensions

  • 8/9/2019 Machine Learning Course

    37/326

  • 8/9/2019 Machine Learning Course

    38/326

  • 8/9/2019 Machine Learning Course

    39/326

    6.891 Machine Learning

    Probabilistic Classification Review

    P(F|C) & P(C) -> P(F,C)

  • 8/9/2019 Machine Learning Course

    40/326

    6.891 Machine Learning

    Information Retrieval

  • 8/9/2019 Machine Learning Course

    41/326

    6.891 Machine Learning

    Keyword Search Works Well

  • 8/9/2019 Machine Learning Course

    42/326

    6.891 Machine Learning

    Nave Bayes Classifier

    )|( ji CFPProbability of word i appearing

    in a Doc from Class j

    )( ji Docf 1 if Doc j has word i.

    )|()|}({ jii

    iji CfFPCfP ==

    )(

    )|()(

    }){|(i

    i

    i

    ji

    i

    ij

    ijfFP

    CfFPCP

    fCP=

    =

    =

  • 8/9/2019 Machine Learning Course

    43/326

    6.891 Machine Learning

    {Docs}#

    i}wordcontainingDocs{#)( =iFP

    Docs}{Training#

    i}wordwithDocsTraining{#)|( =ji CFP

    Potential Bug:

    None of our Training Docs contain Mercedes

    Estimating Probabilities

  • 8/9/2019 Machine Learning Course

    44/326

    6.891 Machine Learning

  • 8/9/2019 Machine Learning Course

    45/326

  • 8/9/2019 Machine Learning Course

    46/326

    6.891 Machine Learning

    Density Estimation is Ambiguous

  • 8/9/2019 Machine Learning Course

    47/326

    6.891 Machine Learning

    Impacts Classification

  • 8/9/2019 Machine Learning Course

    48/326

  • 8/9/2019 Machine Learning Course

    49/326

    Machine Learning

    Review & Overview

    Machine Learning

    Keyword Search Works Well

  • 8/9/2019 Machine Learning Course

    50/326

    Machine Learning

    Bayesian Text Classification

    Probability of word i appearing

    in a Doc from Class j

    { }

    )|(

    )|(

    )|,,(

    1

    2211

    ji

    i

    i

    jN

    j

    cCfFP

    cCffP

    cCfFfFP

    ====

    ===

    Assume

    Independence

    Assume

    Independence

    ijji pcCFP === )|1(

    { }

    otherwise

    d)(dW

    d

    k

    ki

    k

    0

    iwordcontainsif1:

    documentsofcollectionA:

    Machine Learning

    Bayes Nets Show Dependencies

    )|()|}({ jii

    iji CfFPCfP ==

    F1 F2 F3 F4 FN

    C

    ...

    )|( 1 CFP

    Bayes Nets show the dependencies between RVs

  • 8/9/2019 Machine Learning Course

    51/326

    Machine Learning

    Classification Using Bayes Law

    )(

    )|()(

    }){|(

    =

    i

    i

    ji

    i

    j

    ijfP

    cfPcP

    fcP

    DocumentsOther

    CarsGerman

    0

    1

    =

    =

    c

    c

    Machine Learning

    Estimating Probability Distributions

    ijji pcCFP === )|1(

    { }

    =otherwise

    df)(dW

    d

    k

    kiki

    k

    0

    iwordcontainsif1:

    documentsofcollectionA:

    p_ij?

    p_ij

  • 8/9/2019 Machine Learning Course

    52/326

    Machine Learning

    Maximum Likelihood

    { }( ) ( ) { }( )

    ( )

    ( ) ( )( )

    ( ) ( )( )iijiij

    k i

    kiij

    kiij

    k i

    ki

    k

    kNk

    k

    kk

    nNpnp

    fpfp

    cfP

    cffPcdPcdP

    =

    =

    =

    ==

    1

    11

    |

    |||

    0

    0100

    Machine Learning

    Log Likelihood

    ( ) ( )( )

    )1log()()log(

    1log

    ijiiji

    iij

    iij

    pnNpn

    nNpnpL

    +=

    =

    0

    1

    1)(

    1

    )1log()(

    )log(

    =

    +=

    +

    =

    ij

    i

    ij

    i

    ij

    ij

    i

    ij

    ij

    i

    ij

    pnN

    pn

    p

    pnN

    p

    pn

    p

    L

  • 8/9/2019 Machine Learning Course

    53/326

    Machine Learning

    Maximum Likelihood

    ij

    ij

    i

    i

    ij

    ij

    i

    i

    ij

    i

    ij

    i

    ij

    i

    ij

    i

    p

    p

    NnNn

    p

    p

    nN

    n

    p

    nN

    p

    n

    pnN

    pn

    =

    =

    =

    =+

    11

    1)(

    1

    )(

    01

    1)(1

    N

    np iij=

    Machine Learning

    {Docs}#

    i}wordcontainingDocs{#)1( ==iFP

    Docs}{Training#

    i}with wordDocsTraining{#)|1( == ji CFP

    Potential Bug:

    None of our Training Docs contain Mercedes

    Estimating Probabilities

  • 8/9/2019 Machine Learning Course

    54/326

    Machine Learning

    Prior Expectations

    p(mercedes | GermanCars)?

    Machine Learning

    Bayesian Parameter Estimation

    Maximum A PosteoriMaximum A Posteori

    { }( ) { }

    { }( )0

    0

    0|

    )(,|,|

    cdP

    pppcdPcdpP

    k

    ijijk

    kij =

    Maximum LikelihoodMaximum Likelihood

    { } ijk

    ij

    pcdPp

    ,|max 0

    This turns out to be more useful

    for continuous parameters

    This turns out to be more useful

    for continuous parameters

  • 8/9/2019 Machine Learning Course

    55/326

    Machine Learning

    What is the right prior?

    { } { }

    { }( )ijkijijkkij

    pcdP

    pppcdPcdpP

    ,|max

    )(,|max,|max

    0

    00

    =

    = MaximumLikelihood

    { }( ) ( ) ( )( )iijiijijknNpnppcdP = 1,| 0

    Machine Learning

    Probability of the parameters

    { }( ) ( ) ( )( )iijiijijkCNpCppcdP = 1,| 0

  • 8/9/2019 Machine Learning Course

    56/326

    Machine Learning

    Bayesian Estimation

    { }( ) { }

    { }( )0

    0

    0|

    )(,|,|

    cdP

    pppcdPcdpP

    k

    ijijk

    kij = ijijji ppcCFP === ),|1(

    { }( )= ijjkijijjiji dpcdpppcFPcFP ,|),|()|({ }

    { }( )

    { }( ){ }( )0

    0

    0

    0

    |

    )(,|

    |

    )(,|

    cdP

    pppcdPp

    cdP

    pppcdPp

    k

    ijijkij

    k

    ijijk

    ij

    =

    =

    Machine Learning

    Continued

    { }( )

    { }( )

    ( ) ( )[ ]

    ( ) ( )[ ]

    =

    =

    ij

    nN

    ij

    n

    ij

    ij

    nN

    ij

    n

    ijij

    ijijijk

    ijijijkij

    ji

    dppp

    dpppp

    dppppcdP

    dppppcdPpcFP

    ii

    ii

    1

    1

    )(,|

    )(,|)|(

    0

    0

  • 8/9/2019 Machine Learning Course

    57/326

  • 8/9/2019 Machine Learning Course

    58/326

    Machine Learning

    6.891 Machine Learning and Neural

    Networks

    Machine Learning

    News

  • 8/9/2019 Machine Learning Course

    59/326

    Machine Learning

    Review & Overview

    Machine Learning

    Why Gaussians ?

  • 8/9/2019 Machine Learning Course

    60/326

  • 8/9/2019 Machine Learning Course

    61/326

    Machine Learning

    Recall: Bayes Decision Boundaries

    Machine Learning

    Descriminant Function

  • 8/9/2019 Machine Learning Course

    62/326

    Machine Learning

    Set Discriminants Equal

    Machine Learning

  • 8/9/2019 Machine Learning Course

    63/326

    Machine Learning

    Bayesian Parameter Estimation

    Machine Learning

    Convergence of Probability

  • 8/9/2019 Machine Learning Course

    64/326

    Machine Learning

    Reminder: Why we are here

    -2 -1 0 1 2 3 4 50

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    0.1781 0.2013

    0.8293

    0.8299

    0.9217

    0.9434

    0.9851

    1.0079

    1.0539

    1.5355

    1.5621

    1.5875

    1.6015

    2.1811

    2.7845 2.7879

    3.0956

    3.8428

    3.9562

    4.0800

    0.1781 0.2013

    0.8293

    0.8299

    0.9217

    0.9434

    0.9851

    1.0079

    1.0539

    1.5355

    1.5621

    1.5875

    1.6015

    2.1811

    2.7845

    2.7879

    3.0956

    3.8428

    3.9562

    4.0800

    -1.2660 -0.8724

    -0.8081

    -0.6223

    -0.1624

    -0.1342

    -0.1098

    -0.0882

    0.1258

    0.1395

    0.1914

    0.2873

    0.3409

    0.3694

    0.6093 0.6463

    1.1217

    1.1463

    1.3021

    1.3971

    -1.2660 -0.8724

    -0.8081

    -0.6223

    -0.1624

    -0.1342

    -0.1098

    -0.0882

    0.1258

    0.1395

    0.1914

    0.2873

    0.3409

    0.3694

    0.6093

    0.6463

    1.1217

    1.1463

    1.3021

    1.3971

    Machine Learning

    Max Likelihood Gaussian

    Mean: 0.16

    StDev: 0.8

    Mean: 2.2

    StDev: 1.0

  • 8/9/2019 Machine Learning Course

    65/326

    Machine Learning

    Different Samples, Different Decisions

    Concept: Variance

    The Variation you observe

    when training on different

    independent training sets

    Concept: Variance

    The Variation you observe

    when training on different

    independent training sets

    Machine Learning

    Variance depends on data set size...

    20 points 2000 points

  • 8/9/2019 Machine Learning Course

    66/326

    Machine Learning

    But when data gets more complex...

    Machine Learning

    Gaussian dont work well

    Concept: Training Error

    Error in your classifier

    on the training set

    Concept: Training Error

    Error in your classifier

    on the training set

  • 8/9/2019 Machine Learning Course

    67/326

  • 8/9/2019 Machine Learning Course

    68/326

  • 8/9/2019 Machine Learning Course

    69/326

    Paul Viola 1999 Machine Learning 3

    Review & Overview

    Paul Viola 1999 Machine Learning 4

    Histogram -1.2660 -0.8724

    -0.8081

    -0.6223

    -0.1624

    -0.1342

    -0.1098

    -0.0882

    0.1258

    0.1395

    0.1914

    0.2873

    0.3409

    0.3694 0.6093

    0.6463

    1.1217

    1.1463

    1.3021

    1.3971

    -1.2660

    -0.8724

    -0.8081

    -0.6223

    -0.1624

    -0.1342

    -0.1098

    -0.0882

    0.1258

    0.1395

    0.1914

    0.2873

    0.3409

    0.3694

    0.6093

    0.6463

    1.1217

    1.1463

    1.3021

    1.3971

    Divide by N to

    yield a probability

    -1.5 -1 -0.5 0 0.5 1 1.5

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

  • 8/9/2019 Machine Learning Course

    70/326

    Paul Viola 1999 Machine Learning 5

    Simple Algorithm

    function counts = myhist(data, centers)

    % Initialize counts

    counts = zeros(size(centers));

    numdata = size(data,1);

    % For each datapoint compute distance to every center

    for i = 1:numdata

    diffs = data(i,1) - centers;

    dists = diffs.^2;

    [minval mindist] = min(dists); counts(mindist) = counts(mindist) + 1;

    end

    Paul Viola 1999 Machine Learning 6

    Histogram

  • 8/9/2019 Machine Learning Course

    71/326

    Paul Viola 1999 Machine Learning 7

    Max Likelihood Gaussian

    Mean: 0.16

    StDev: 0.8

    Mean: 2.2

    StDev: 1.0

    Paul Viola 1999 Machine Learning 8

    Histogram Flexibility is Adjustable

    -2 - 1 0 1 2 3 4 50

    1

    2

    3

    4

    5

    6

    7

    8

    -2 - 1 0 1 2 3 4 50

    1

    2

    3

    4

    5

    6

    7

    8

    -2 - 1 0 1 2 3 4 50

    1

    2

    3

    4

    5

    6

    7

    8

    -2 - 1 0 1 2 3 4 50

    1

    2

    3

    4

    5

    6

    7

    8

    -2 - 1 0 1 2 3 4 50

    1

    2

    3

    4

    5

    6

    7

    8

    -2 - 1 0 1 2 3 4 50

    1

    2

    3

    4

    5

    6

    7

    8

  • 8/9/2019 Machine Learning Course

    72/326

    Paul Viola 1999 Machine Learning 9

    Histograms have lower bias

    Paul Viola 1999 Machine Learning 10

    but higher variance.

  • 8/9/2019 Machine Learning Course

    73/326

    Paul Viola 1999 Machine Learning 11

    Parzen: One Bump per Data Point

    -1.5 -1 -0.5 0 0.5 1 1.5

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    Paul Viola 1999 Machine Learning 12

    Parzen Algorithm

    function [func, range] = parzen(data, sigma)

    range = splitrange(min(data), max(data), 500);

    numdata = size(data, 1)

    % For each point on range compute distance to every

    % datapoint.

    for i = 1:size(range, 2)

    gaussvals = gauss(range(i) - data, 0, sigma);

    func(i) = sum(gaussvals)/numdata;

    end

    plot(range, func)

  • 8/9/2019 Machine Learning Course

    74/326

    Paul Viola 1999 Machine Learning 13

    Parzen and Histogram are Similar

    Paul Viola 1999 Machine Learning 14

    All Three at Once

    -1.5 -1 -0.5 0 0.5 1 1.5

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

  • 8/9/2019 Machine Learning Course

    75/326

    Paul Viola 1999 Machine Learning 15

    Properties of Non-parametric Techniques

    Paul Viola 1999 Machine Learning 16

    Semi-Parametric Models

  • 8/9/2019 Machine Learning Course

    76/326

    Paul Viola 1999 Machine Learning 17

    11,Gaussian

    22 ,Gaussian

    Flip

    Coin

    ),|( 11xp

    ),|( 22xp

    jx

    jx

    1)2()1( ==+= kPkP

    ???)(xp

    Paul Viola 1999 Machine Learning 18

    Events are Disjoint -> They add

    )2(),|()1(),|(

    )2()2|()1()1|(

    )2,()1,(

    2211 =+==

    ===+====

    ==+===

    JPxpJPxp

    JPJxXpJPJxXp

    JxXpJxXp

    )( xXp =

  • 8/9/2019 Machine Learning Course

    77/326

    Paul Viola 1999 Machine Learning 19

    Face Detection

    Paul Viola 1999 Machine Learning 20

    Generating Training Data

    Sung &

    Poggio

  • 8/9/2019 Machine Learning Course

    78/326

    Paul Viola 1999 Machine Learning 21

    Results

    But, it takes minutes per image

    Paul Viola 1999 Machine Learning 22

    Face Detection

  • 8/9/2019 Machine Learning Course

    79/326

    Paul Viola 1999 Machine Learning 23

    Events are Disjoint -> They add

    )2(),|()1(),|(

    )2()2|()1()1|(

    )2,()1,(

    2211 =+==

    ===+====

    ==+===

    JPxpJPxp

    JPJxXpJPJxXp

    JxXpJxXp

    )( xXp =

    Paul Viola 1999 Machine Learning 24

    Expectation Maximization

    =

    j

    j

    j

    jj

    kxkP

    xxkP

    )|(

    )|(

    ( )

    =

    j

    j

    j

    k

    jj

    kxkP

    xxkP

    )|(

    )|(2

    2

    =j

    jxkPkP )|()(

    ( )},,{log kkk qlE =

    Bounded Below?

    Decreases?

  • 8/9/2019 Machine Learning Course

    80/326

    Paul Viola 1999 Machine Learning

    News

    Paul Viola 1999 Machine Learning

    Distribution of Grades: Pset 1

  • 8/9/2019 Machine Learning Course

    81/326

    Paul Viola 1999 Machine Learning

    Review & Overview

    Paul Viola 1999 Machine Learning

    Where are we?

  • 8/9/2019 Machine Learning Course

    82/326

    Paul Viola 1999 Machine Learning

    Between density and classification

    Paul Viola 1999 Machine Learning

    Gaussians vs. Discriminants

    N2

    N

    o

    Twxxy += w)( Two Class Gaussian

    same Covariance

    Two Class Gaussian

    same Covariance

  • 8/9/2019 Machine Learning Course

    83/326

    Paul Viola 1999 Machine Learning

    Linear Discriminant

    o

    Twxxy += w)(

    X0 X1 X2 Xd

    y

    w0 w1 w2

    wd

    =

    ==N

    i

    ii

    Txwxxy

    0

    )( w

    Bias

    Warning!

    Paul Viola 1999 Machine Learning

    Multiple Discriminants

    o

    Twxxy 111 )( += w

    0

    0)()(

    )()(

    0

    201021

    202101

    21

    =+

    =+

    +=+

    =

    wx

    wwx

    wxwx

    xyxy

    T

    T

    TT

    w

    ww

    ww

    o

    Twxxy 222 )( += w

  • 8/9/2019 Machine Learning Course

    84/326

    Paul Viola 1999 Machine Learning

    in a single network

    +=i

    koikik wxwxy )(

    matrixweightwki

    )(maxarg)( xykifCxC ii

    k ==

    Paul Viola 1999 Machine Learning

    Multiple Discriminants

    Intersection

    of Half Planes

  • 8/9/2019 Machine Learning Course

    85/326

  • 8/9/2019 Machine Learning Course

    86/326

    Paul Viola 1999 Machine Learning

    Quadratic cost is very simple

    ( ) ( )

    TTTWXXWXW

    TXWTXWWE

    TTTTT

    T

    +=

    =

    2

    )(( )

    ( )2

    2)()(

    =

    =

    j

    jjT

    j

    jj

    tx

    txyE

    w

    w

    ( ) TXXXW

    TXXWX

    TXXWXdW

    WdE

    TT

    TT

    TT

    1

    022)(

    =

    =

    ==

    Paul Viola 1999 Machine Learning

    Is this a model for the brain?

    Pitts and McCulloch, 1947

  • 8/9/2019 Machine Learning Course

    87/326

  • 8/9/2019 Machine Learning Course

    88/326

    Paul Viola 1999 Machine Learning

    Cant Always Solve for the Weights

    )()( xwgxyT=

    =otherwise

    aifag

    0

    01)(

    =

    otherwise

    aifag

    1

    01)(

    Paul Viola 1999 Machine Learning

    Perceptron

  • 8/9/2019 Machine Learning Course

    89/326

    Paul Viola 1999 Machine Learning

    Perceptron Cost Function

    ( )

    ( )w

    xwgtxwg

    w

    wE

    txwgE

    jT

    j

    jjT

    j

    jjT

    =

    =

    )()(2

    )(

    )()(2

    w

    ( )

    =

    =

    =

    errors

    jj

    errors

    jT

    j

    jTjjT

    xtw

    wE

    txw

    txwtxwgE

    2)(

    )(

    )()()(2

    w

    Perceptron

    Criterion

    jj

    tt xtww = 1

    Simple Gradient

    Descent does not work

    Paul Viola 1999 Machine Learning

    Different Error Measures

  • 8/9/2019 Machine Learning Course

    90/326

    Paul Viola 1999 Machine Learning

    Perceptron Learning

    X0 X1 X2 Xd

    y

    w0 w1 w2wd

    Paul Viola 1999 Machine Learning

    Real Perceptrons

  • 8/9/2019 Machine Learning Course

    91/326

    Paul Viola 1999 Machine Learning

    Paul Viola 1999 Machine Learning

    A classic problem...

    x x

    x

    x x

    x x

    x

    x x

    o o

    o

    oo o

    o

    oo

    o o

    o

    oo o

    o

    oo

  • 8/9/2019 Machine Learning Course

    92/326

    1

    Paul Viola 1999 Machine Learning

    6.891 Machine Learning and Neural

    Networks

    Lecture 7:

    Multi-Layer Perceptrons

    Back Propagation

    Paul Viola 1999 Machine Learning

    Newsl Pset 3 is on the web

    Includes a classifier shootout

    The mystery dataset has 20 dimensions and two classes

    Winner gets $10 of Toscaninis

    l Pset 2 looks great Many of you did a lot of work.

  • 8/9/2019 Machine Learning Course

    93/326

    2

    Paul Viola 1999 Machine Learning

    Review & Overview

    l Lecture 6: Linear Discriminants

    Perceptrons

    Training Perceptrons

    l Generalized Perceptrons

    l Multi-layer Perceptrons Multi-Layer Derivatives

    Back Propagation

    l Examples: NET Talk

    Paul Viola 1999 Machine Learning

    On-line learning of Perceptrons

    jj

    t

    j

    tt xww

    Eww =

    = 11

    X0 X1 X2 Xd

    y

    w0 w1 w2wd

    l Pick Random Example

    l Observe Output/Errorl Adjust Weights to

    Reduce Error

    =j

    jEE )(w1: Error Function

    (Criteria)

    2: Update Rule

  • 8/9/2019 Machine Learning Course

    94/326

    3

    Paul Viola 1999 Machine Learning

    Different Criteria

    ( )

    =

    =

    =

    errors

    jj

    errors

    jT

    j

    jTjjT

    xtw

    wE

    txw

    txwtxwgE

    2)(

    )(

    )()()(2

    w( )2)( =j

    jjTtxE ww

    ( )

    =

    =

    j

    jj

    j

    jjjT

    x

    xtxw

    E

    2

    2)(

    ww

    jj

    tt xww = 1

    jj

    tt xtww = 1

    Linear Discriminant Perceptron

    Paul Viola 1999 Machine Learning

    Normalizing examples

    j

    tt xww = 1For errors

    only!

  • 8/9/2019 Machine Learning Course

    95/326

    4

    Paul Viola 1999 Machine Learning

    The update rule in action...

    j

    tt xww = 1

    Paul Viola 1999 Machine Learning

    Real Perceptrons

  • 8/9/2019 Machine Learning Course

    96/326

    5

    Paul Viola 1999 Machine Learning

    A classic problem...

    x x

    x

    x x

    x x

    x

    x x

    oo

    o

    o o o

    o

    oo

    oo

    o

    o o o

    o

    oo

    Paul Viola 1999 Machine Learning

    Generalized Perceptron

    2121 2)( xxxxxXOR += }1,0{ix

    )()( xwgxy T= Cant do that!

    =21

    2

    1

    xx

    x

    x

    x

    =

    1.2

    1

    1

    w WorksGreat

  • 8/9/2019 Machine Learning Course

    97/326

  • 8/9/2019 Machine Learning Course

    98/326

  • 8/9/2019 Machine Learning Course

    99/326

    8

    Paul Viola 1999 Machine Learning

    Multiple Layers

    =W

    =

    5.2110

    0115.1

    0000

    0000

    0000

    W

    How can we learn this??X1 X2

    y

    w54

    w52 w53

    1

    w41 w42w43

    u1 u2 u3

    u4

    Paul Viola 1999 Machine Learning

    1986: The Rebirth of Neural Networksl PDP Group had

    Huge Impact

  • 8/9/2019 Machine Learning Course

    100/326

    9

    Paul Viola 1999 Machine Learning

    1980s: Perhaps Gradient Descent?

    )()( xwsxy T=

    aeas +

    =1

    1)(

    ( )

    ( )w

    xwstxws

    w

    wE

    txwsE

    jT

    j

    jjT

    j

    jjT

    =

    =

    )(

    )(2)(

    )()(2

    w

    ( )w

    uusus

    w

    us

    =

    )(1)()(

    ( ) ( ) jjTjTj

    jjTxxwsxwstxws

    w

    wE)(1)()(2

    )( =

    Paul Viola 1999 Machine Learning

    Sigmoid Multi-Layer Network

    ))(

    )((

    )()(

    35325215165

    34324214154

    565454

    uwuwuwsw

    uwuwuwsws

    uwuwsxy

    +++++=

    +=

    ( )

    ( ) wxws

    txwsw

    wE

    txwsE

    jT

    j

    jjT

    j

    jjT

    =

    =

    )(

    )(2)(

    )()(2

    w

    X1 X2

    y

    w64

    w52

    w65

    1

    w41w42

    w43

    u1 u2 u3

    u4u5

    u6

    w53w51

    w10

  • 8/9/2019 Machine Learning Course

    101/326

    10

    Paul Viola 1999 Machine Learning

    Multi-Layer Conventions

    =

    j

    jkjk uwgu

    =j

    jkjk uwa

    l Networks must not haveloops

    l

    Units are ordered: i > k --> uiis not an input to ukl Compute Units in order

    X1 X2

    y

    w64

    w52

    w65

    1

    w41w42

    w43

    u1 u2 u3

    u4u5

    u6

    w53w51

    Paul Viola 1999 Machine Learning

    More Conventions

    l If Units are organized in Layers Layers can be computed in parallel.

    = 1212 * uWu g

    = 2323 * uWu g

  • 8/9/2019 Machine Learning Course

    102/326

    1

    Paul Viola 1999 Machine Learning

    Paul Viola 1999 Machine Learning

  • 8/9/2019 Machine Learning Course

    103/326

    12

    Paul Viola 1999 Machine Learning

    Solving XOR (big deal?)

    Paul Viola 1999 Machine Learning

    Vision Applications (sort of)

    Ts vs. Cs

  • 8/9/2019 Machine Learning Course

    104/326

    13

    Paul Viola 1999 Machine Learning

    Very Simple Solution

    Paul Viola 1999 Machine Learning

    NETtalk (1986) First Real Applicationl Task: Pronounce English text

    Text -> Phonemes

    l Example: This is it. /s/ vs. /z/

    l 29 possible characters

    l 26 phonemes

    l 7 Character Window

    l Structure 203 Inputs

    26 Output

    80 Hidden

    95% Accurate

  • 8/9/2019 Machine Learning Course

    105/326

    1

    Paul Viola 1999 Machine Learning

    6.891 Machine Learning and Neural

    Networks

    Lecture 8:

    Back Prop and Beyond

    Paul Viola 1999 Machine Learning

    Newsl Mid-term will be on 10/20

    Here in this room.

    It should take about 1 hour but we will give you 1.5 Show up on time, please.

    Coverage: Psets 1, 2 and 3.

    Density estimation (Parametric, Semi and Non-parametric)

    Bayesian Classification

    Discriminants (Linear, Perceptron, Multi-layer)

  • 8/9/2019 Machine Learning Course

    106/326

    2

    Paul Viola 1999 Machine Learning

    Review & Overview

    l Lecture 7: Multi-Layer Derivatives

    Back Propagation

    Examples: NET Talk

    l Why 6.891 is not Over! Bugs with Gradient Descent

    Local Min

    Bias and Variance How many units?

    Variants

    Paul Viola 1999 Machine Learning

    Face Detection Network:

    General Layout

    Baluja, Rowley, and Kanade

  • 8/9/2019 Machine Learning Course

    107/326

    3

    Paul Viola 1999 Machine Learning

    Intensity Preprocessing

    Paul Viola 1999 Machine Learning

    Training Data

    Positives

    Negativ

    es

  • 8/9/2019 Machine Learning Course

    108/326

    4

    Paul Viola 1999 Machine Learning

    Performance

    Paul Viola 1999 Machine Learning

  • 8/9/2019 Machine Learning Course

    109/326

    5

    Paul Viola 1999 Machine Learning

    MLP: How Powerful?

    Paul Viola 1999 Machine Learning

    Derivatives are Cheapl Modeling Factories/Plants

    PlantInput

    Control

    Materials

    Outputs

    Products

    MLP

    Derivatives

    Train with Back Prop

    Use Derivs to Modify input to improve output

  • 8/9/2019 Machine Learning Course

    110/326

    6

    Paul Viola 1999 Machine Learning

    1990: The height of MLPs and Back Prop

    l Multi-layer perceptrons can solve anyapproximation problem (in principle) Given 3 layers

    Given and infinite number of units and weights

    l There is no direct technique for finding theweights (unlike linear discriminants)

    l Gradient descent (using Back Prop) comes todominate discussion in the Neural Net community Can you find a good set of weights quickly?

    How can you speed things up? Will you get stuck in local minima?

    l A small group in the community also worries aboutgeneralization.

    Paul Viola 1999 Machine Learning

    How long til we find the Min?l Simplest Case

    1 Weight, Quadratic Error Function w

    y

    x

    2

    2

    2

    )0(

    )()(

    w

    w

    ywxwE

    =

    =

    =

    1

    1

    =

    t

    ttw

    Eww

    ww

    E

    t

    21

    =

    1

    2

    1=

  • 8/9/2019 Machine Learning Course

    111/326

    7

    Paul Viola 1999 Machine Learning

    Scale the Input

    l Simplest Case 1 Weight, Quadratic Error Function w

    y

    x

    2

    2

    4

    )02()(

    w

    wwE

    =

    =

    1

    1

    =

    t

    ttw

    Eww

    ww

    E

    t

    81=

    2

    Hack 1: Start eta very small

    Increase if Error decreasesMore Hacks Coming!

    Paul Viola 1999 Machine Learning

    Multiple Weights

    0.020

    0.047

    0.049

    0.050

    Hack 2: Momentum

  • 8/9/2019 Machine Learning Course

    112/326

    8

    Paul Viola 1999 Machine Learning

    Momentum

    1

    1

    +

    = t

    t

    t ww

    Ew

    Paul Viola 1999 Machine Learning

    l Gradient descent assumes a locally linear costfunction:

    l Second order techniques assume locally quadratic:

    Second Order Techniques

    Ew

    Ew

    t

    t =

    =

    1 ( )2

    )()(

    )()()(

    wEwE

    wEwEwE

    =

    ++=+

    E

    Ewt

    =

    a

    bw

    E

    E

    aE

    bawE

    cbwawwE

    2

    2

    2

    )( 2

    +=

    =

    +=

    ++=

    ( )

    ab

    abww

    www

    2

    200

    01

    =

    +=

    +=

  • 8/9/2019 Machine Learning Course

    113/326

  • 8/9/2019 Machine Learning Course

    114/326

  • 8/9/2019 Machine Learning Course

    115/326

    1

    Paul Viola 1999 Machine Learning

    No Hands Across America

    Paul Viola 1999 Machine Learning

    Zip Codes

    Le Cun

  • 8/9/2019 Machine Learning Course

    116/326

    1

    Paul Viola 1999 Machine Learning

    6.891 Machine Learning and Neural

    Networks

    Lecture 9:

    On to Support Vector Techniques

    Paul Viola 1999 Machine Learning

    Newsl Final will be 12/13 at 1:30PM

    If you have a conflicting final let us know.

    l Remember that almost all the material appears inthe book Right now we are jumping back and forth between

    Chapter 5

    Chapter 6

  • 8/9/2019 Machine Learning Course

    117/326

  • 8/9/2019 Machine Learning Course

    118/326

    3

    Paul Viola 1999 Machine Learning

    Why did we need multi-layer

    perceptrons?

    l Problems like this seem to require very complexnon-linearities.

    l Minsky and Papert showed that an exponential

    number of features is necessary to solve genericproblems.

    Paul Viola 1999 Machine Learning

    Why an exponential number of features?

    =M

    4

    2

    3

    21

    2

    2

    2

    1

    2

    2

    2

    12

    3

    1

    4

    1

    5

    2

    4

    21

    3

    2

    2

    1

    2

    2

    3

    12

    4

    1

    5

    1

    ,,,,,

    ...,,,,,

    )( xxxxxxxxxx

    xxxxxxxxxx

    x

    14th Order???

    120 Features

    ( )),min(!!

    )!( nk knOnk

    kn

    k

    kn +=

    +

    polyorder:

    variables:

    k

    n

    N=21, k=5 --> 65,000 featuresN=21, k=5 --> 65,000 features

  • 8/9/2019 Machine Learning Course

    119/326

    4

    Paul Viola 1999 Machine Learning

    MLPs vs. Perceptron

    l MLPs are incredibly hard to train Takes a long time (unpredictably long)

    Can converge to poor minima

    l MLP are hard to understand What are they really doing?

    l Perceptrons are easy to train Type of linear programming. Polynomial time.

    One minimum which is global.

    l Generalized perceptrons are easier to understand. Polynomial functions.

    Paul Viola 1999 Machine Learning

    Perceptron Training

    is Linear Programming

    lwi

    l

    ii > 0x After Normalization

    After adding bias

    Assumes no errors

    Polynomial time in the number of variables

    and in the number of constraints.

    lsw li

    l

    ii >+ 0x

    lsl > 0l

    lsmin

    What about linearly inseparable?

  • 8/9/2019 Machine Learning Course

    120/326

    5

    Paul Viola 1999 Machine Learning

    Rebirth of Perceptrons

    l How to train efficiently. Linear Programming ( later quadratic programming)

    l How to get so many features inexpensively?!?

    l How to generalize with so many features? Occams revenge.

    Support Vector Machines

    Paul Viola 1999 Machine Learning

    Lemma 1: Weight vectors are simple

    l The weight vector lives in a sub-space spanned bythe examples Dimensionality is determined by the number of examples

    not the complexity of the space.

    t

    tw x=00=w

    ==l

    l

    l

    errors

    t

    t bw xx =l

    l

    lt bw )(x

  • 8/9/2019 Machine Learning Course

    121/326

    6

    Paul Viola 1999 Machine Learning

    Lemma 2: Only need to compare examples

    Paul Viola 1999 Machine Learning

    Perceptron Rebirth: Generalizationl Too many features Occam is unhappy

    Perhaps we should encourage smoothness?

    lsKb lj

    j

    lj >+ 0),( xx

    lsl > 0

    l

    lsmin

    j

    jb2min

    Smoother

    But this is unstable!!

  • 8/9/2019 Machine Learning Course

    122/326

    7

    Paul Viola 1999 Machine Learning

    The linear could return any multiple of the correct

    weight vector...

    lwi

    l

    ii > 0 x ( ) lwi

    l

    ii > 0 x

    Linear Program is not unique

    lswl

    i

    l

    ii >+0x

    lsl > 0l

    l

    smin

    i

    iw2

    min

    Slack variables & Weight prior

    - Force the solution toward zero

    Paul Viola 1999 Machine Learning

    Definition of the Margin

    l

    Margin: Gap between negatives and positivesmeasured perpendicular to a hyperplane

  • 8/9/2019 Machine Learning Course

    123/326

  • 8/9/2019 Machine Learning Course

    124/326

  • 8/9/2019 Machine Learning Course

    125/326

    1

    Paul Viola 1999 Machine Learning

    SVM: examples

    Paul Viola 1999 Machine Learning

    SVM: Key Ideasl Augment inputs with a very large feature set

    Polynomials, etc.

    l Use Kernel Trick(TM) to do this efficiently

    l Enforce/Encourage Smoothness with weight penalty

    Minimize

    l Introduce Margin so that:

    Set of linear inequalities

    l Find best solution using Quadratic Programming

    =i

    i

    T www 2

    iwi 0

    )(x

  • 8/9/2019 Machine Learning Course

    126/326

    1

    Paul Viola 1999 Machine Learning

    SVM: Difficulties

    l How do you pick the kernels? Intuition / Luck / Magic /

    l What if the data is not linearly separable? i.e. the constraints cannot be satisfied

    j

    i

    ij

    i

    jcxKwtj

    1),()12(

    + j jT cww ||min Slack Variables

    Paul Viola 1999 Machine Learning

    SVM: Simple Example

    l

    Data dimension: 2l Feature Space: 2nd order polynomial

    4 dimensional

    6 weights

  • 8/9/2019 Machine Learning Course

    127/326

  • 8/9/2019 Machine Learning Course

    128/326

    1

    Paul Viola 1999 Machine Learning

    Zip Codes

    Much Effort spent on

    organizing the network

    Paul Viola 1999 Machine Learning

    SVM: Zip Code recogntion

    l Data dimension: 256

    l Feature Space: 4 th order roughly 100,000,000 dims

  • 8/9/2019 Machine Learning Course

    129/326

    1

    Paul Viola 1999 Machine Learning

    SVM: Faces

    Paul Viola 1999 Machine Learning

    Support Vectors

  • 8/9/2019 Machine Learning Course

    130/326

    1

    Paul Viola 1999 Machine Learning

    6.891 Machine Learning and Neural

    Networks

    Lecture 10:

    Support Vector Machines

    More Details and Derivations

    Paul Viola 1999 Machine Learning

    Newsl Quiz is 1 week from today.

    l Problem set 4 will go out right after the quiz In one week its a pain to do two things at once.

    l Problems set are very good (once again). On an absolute scale many of you are getting As.

  • 8/9/2019 Machine Learning Course

    131/326

    2

    Paul Viola 1999 Machine Learning

    Pset 2

    Paul Viola 1999 Machine Learning

    Review & Overviewl Lecture 9:

    Resurrecting Perceptrons

    Setting up Support Vector Machines

    l SVM review

    l Why is it called Support Vectors??

    l Derivation of some simpler properties.

  • 8/9/2019 Machine Learning Course

    132/326

    3

    Paul Viola 1999 Machine Learning

    SVM: Key Ideas

    l Augment inputs with a very large feature set Polynomials, etc.

    l Use Kernel Trick(TM) to do this efficiently

    l Enforce/Encourage Smoothness with weight penalty

    Minimize

    l Find best solution using Quadratic Programming

    =i

    i

    T bbb 2 ibi 0

    )(x

    Avoid!

    Paul Viola 1999 Machine Learning

    Support Vectors

    l Many of the bs are zero -- inactive constraints Only keep examples where

    l Likely to generalize well VC Dimension -- later in the semester

    1),(

    i

    ij

    i cxKbj

    constraintsubject to)min( wwT

    =

    i

    i

    i cxKbxy ),()(

    0ib

  • 8/9/2019 Machine Learning Course

    133/326

  • 8/9/2019 Machine Learning Course

    134/326

    5

    Paul Viola 1999 Machine Learning

    The optimal dividing line

    l The optimal separatormaximizes the marginbetweenpositive and negative examples

    iT

    negativesxwd max=

    iT

    positivesxwd min=+

    ( )||

    maxmarginmaxw

    dd

    ww

    +=

    ||margin

    w

    dd +=

    Paul Viola 1999 Machine Learning

    Definition of the Margin

    l

    Margin: Gap between negatives and positivesmeasured perpendicular to a hyperplane

  • 8/9/2019 Machine Learning Course

    135/326

    6

    Paul Viola 1999 Machine Learning

    Optimal dividing line=Support Vectors

    iT

    negativesxwd max=

    iT

    positivesxwd min=+

    ||max

    w

    dd

    w

    +

    1 iTnegatives

    xw

    1 iTpositives

    xw

    wwT

    min

    Paul Viola 1999 Machine Learning

    Lemma 1: Weight vectors are simple

    l The weight vector lives in a sub-space spanned bythe examples

    l Proved this to you by analyzing the perceptronweight update rule But we no longer use that rule!!!

    Instead we use Quadratic Programming

    =l

    l

    lbw x =l

    l

    lbw )(x

  • 8/9/2019 Machine Learning Course

    136/326

    7

    Paul Viola 1999 Machine Learning

    Lemma 1: Kuhn-Tucker Conditions

    1

    1

    1

    3

    2

    1

    xw

    xw

    xw

    T

    T

    T

    ( )wwTmin

    11

    2

    1

    ==

    xw

    xw

    T

    T

    2

    2

    1

    1 xxw bb +=

    Some of the examples do not contribute

    the inactive constraints.

    Paul Viola 1999 Machine Learning

    SVM versus Perceptronl Why not just use a perceptron?

    Use all training points as a centers

    Update using perceptron rule:

    l

    Perceptrons do not maximize the margin The estimated function is not terribly smooth

    l Perceptrons do not rely on very few support vectors Yields a much more efficient classifier.

    =

    i

    iT

    i cxKwxy ),()(

    ),())((i

    ii cxKxytww +=

  • 8/9/2019 Machine Learning Course

    137/326

  • 8/9/2019 Machine Learning Course

    138/326

    9

    Paul Viola 1999 Machine Learning

    Support Vectors

    Paul Viola 1999 Machine Learning

    SVM: Difficultiesl How do you pick the kernels?

    Intuition / Luck / Magic /

    l What if the data is not linearly separable? i.e. the constraints cannot be satisfied

    1),( +

    j

    i

    ij

    i scxKbj

    + j jT scbb ||min Slack Variables

  • 8/9/2019 Machine Learning Course

    139/326

  • 8/9/2019 Machine Learning Course

    140/326

  • 8/9/2019 Machine Learning Course

    141/326

  • 8/9/2019 Machine Learning Course

    142/326

  • 8/9/2019 Machine Learning Course

    143/326

    3

    Paul Viola 1999 Machine Learning

    Optimal dividing line=Support Vectors

    iT

    negativesxwd max=

    iT

    positivesxwd min=+

    ||max

    w

    dd

    w

    +

    1 iTnegatives

    xw

    1 iTpositives

    xw

    wwT

    1max

    ||

    1

    wd

    = ||

    1

    wd =+

    wwww

    ww

    w

    ddT

    1

    ||

    1

    ||

    ||

    1

    ||

    1

    || 2==

    =

    +

    wwT

    min

    Paul Viola 1999 Machine Learning

    Kernel Networks are Good for Regression

    l The form of the Kernel determines the form ofthe final function Polynomial Kernels -> Polynomial Function

    Gaussian Kernels -> Sum of Gaussians

    l The Common Error Criteria is squared error

    =

    i

    i

    i cxKbxy ),()( =i

    i

    icxKbxy ),()(

    ( ) =j

    jj txyError2

    )(

    This ends up being exactly like polynomial fitting

    except that there is one weight per data point

    This ends up being exactly like polynomial fitting

    except that there is one weight per data point

  • 8/9/2019 Machine Learning Course

    144/326

    4

    Paul Viola 1999 Machine Learning

    Radial Basis Function Networks

    l When we restrict ourselves to Kernels which areradially symmetric, the resulting network is calleda Radial Basis Function Network K only depends on the radial distance from some

    datapoint c.

    Poggio & Girosi pioneered the use of these.

    |)(|),( cxKcxK = =i

    i

    i cxKbxy |)(|)(

    Paul Viola 1999 Machine Learning

    2 4 6 8 100

    0.5

    1

    1.5

    2

    2.5

    3

    rom Smoothness to Kernels:

    Assumptions are Necessary

    0 2 4 6 8 101

    1.5

    2

    2.5

    3

    Intuition

    ? ?

  • 8/9/2019 Machine Learning Course

    145/326

    5

    Paul Viola 1999 Machine Learning

    Setting up the problem

    W = 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

    T =1

    0 0 0

    3 0 0 0 0 1

    TWWWY TT 1)( =

    ( )2TWYCost =

    Not Invertible!

    Paul Viola 1999 Machine Learning

    Conditioning the Problem

    TWIWWY TT 1)( +=

    ( ) YYTWYCost T+= 2 =i

    i

    TyYY

    2

    Small solution

    vectors are best.Y =

    1 0 0 0 3 0 0 0 0 1

    T =1

    0 0 0 3 0 0 0 0 1

    2 4 6 8 100

    0.5

    1

    1.5

    2

    2.5

    3

  • 8/9/2019 Machine Learning Course

    146/326

    6

    Paul Viola 1999 Machine Learning

    and the winner is?

    2 4 6 8 100

    1

    2

    3

    4

    5

    6

    2 4 6 8 100

    0.5

    1

    1.5

    2

    2.5

    3

    2 4 6 8 100

    0.5

    1

    1.5

    2

    2.5

    3

    This is not always true

    remember to think like a

    Bayesian

    Paul Viola 1999 Machine Learning

    Smooth is Good: Regularization

    l Alternative way to motivate Kernel Networks.

    ( )

    Cost y Error y Smoothness y

    y x tx

    y x dxj j

    j

    ( ) ( ) ( )

    ( )$

    ( $) $

    = +

    = +

    2

    2

  • 8/9/2019 Machine Learning Course

    147/326

  • 8/9/2019 Machine Learning Course

    148/326

    8

    Paul Viola 1999 Machine Learning

    Need to find lambda

    Y=1.0000

    1.5000 2.0000 2.5000 3.0000 2.6000 2.2000 1.8000 1.4000 1.0000

    001.0=

    2 4 6 8 100

    0.5

    1

    1.5

    2

    2.5

    3

    10=

    Y = 1.6000 1.6600

    1.7200 1.7800 1.8400 1.7840 1.7280 1.6720 1.6160 1.5600

    0.3

    2 4 6 8 100

    0.5

    1

    1.5

    2

    2.5

    3

    1.8

    Paul Viola 1999 Machine Learning

    Derivative Order Controls Shape

    1 0 2 0 30 40 500

    0. 5

    1

    1. 5

    2

    2. 5

    3

    1 0 2 0 30 40 500

    0. 5

    1

    1. 5

    2

    2. 5

    3

  • 8/9/2019 Machine Learning Course

    149/326

  • 8/9/2019 Machine Learning Course

    150/326

    10

    Paul Viola 1999 Machine Learning

    Still Piecewise Cubic

    10 20 30 40 50

    -1

    -0.8

    -0.6

    -0.4

    -0.2

    0

    0.2

    0.4

    0.6

    0.8

    10 20 30 40 50

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    Paul Viola 1999 Machine Learning

    Smoothness is easily controlled

    10 20 30 40 50

    0

    0. 1

    0. 2

    0. 3

    0. 4

    0. 5

    0. 6

    0. 7

    0. 8

    0. 9

    10 20 30 40 50

    0

    0. 1

    0. 2

    0. 3

    0. 4

    0. 5

    0. 6

    0. 7

    0. 8

    0. 9

    10 20 30 40 50

    0

    0. 1

    0. 2

    0. 3

    0. 4

    0. 5

    0. 6

    0. 7

    0. 8

    0. 9

    10 20 30 40 50

    0

    0. 1

    0. 2

    0. 3

    0. 4

    0. 5

    0. 6

    0. 7

    0. 8

    0. 9

    10 20 30 40 50

    0

    0. 1

    0. 2

    0. 3

    0. 4

    0. 5

    0. 6

    0. 7

    0. 8

    0. 9

    10 20 30 40 50

    0

    0. 1

    0. 2

    0. 3

    0. 4

    0. 5

    0. 6

    0. 7

    0. 8

    0. 9

  • 8/9/2019 Machine Learning Course

    151/326

    11

    Paul Viola 1999 Machine Learning

    Regularization to RBFs

    l Alternative way to motivate RBFs

    )()()( ySmoothnessyErroryE +=

    Gaussian

    Every

    Training Point

    Paul Viola 1999 Machine Learning

    roblem: To many centers

    (Old solutions )

    l One per data point can be way to many

    l Choose a random subset of the points Hope you dont get unlucky

    l Distribute them based on the density of points Perhaps EM clustering

  • 8/9/2019 Machine Learning Course

    152/326

    12

    Paul Viola 1999 Machine Learning

    Too Many Centers 2

    l Put them where you need them To best approximate your function

    l Compute the derivative of E(y) w.r.t. the centers This gets very hairy and does not work well

    Too many local minima -- no small weight trick

    Paul Viola 1999 Machine Learning

    Too Many Centers 3

    l Support Vector Regression Next time.

  • 8/9/2019 Machine Learning Course

    153/326

    1

    Paul Viola 1999 Machine Learning

    6.891 Machine Learning and Neural

    Networks

    Lecture 12:

    Smooth Functions and Kernel Networks

    Paul Viola 1999 Machine Learning

    Newsl Quiz was too hard

    I am trying to come up with a creative grading scheme.

    Best 5 out of 6 problems???

    First let us do the grading.

    l Problem set will be out by tonight.

  • 8/9/2019 Machine Learning Course

    154/326

    2

    Paul Viola 1999 Machine Learning

    Review & Overview

    l Lecture 11: Trying to find smooth functions.

    l Smooth Regression Another way of motivating Kernel networks

    Paul Viola 1999 Machine Learning

    where we were last time.

    Squared 1st

    Derivative

    2 4 6 8 1 0

    0

    2

    4

    6

    8

    10

    12

    14

    16

    Sum = 49.7

    2 4 6 8 1 00

    1

    2

    3

    4

    5

    6

    7

    8

    9

    Sum = 20

    2 4 6 8 100

    0.5

    1

    1.5

    2

    2.5

    3

    Sum = 1.8

    0 2 4 6 8 101

    1.5

    2

    2.5

    3

  • 8/9/2019 Machine Learning Course

    155/326

    3

    Paul Viola 1999 Machine Learning

    Regression Review

    l Up until now we have been mostly analyzingclassification:

    X, inputs. Y, classes. Find the best c(x).

    l Today: Regression.

    X, inputs. Y, outputs. Find the bestf(x).

    Predict the stocks value next week.

    Picture of Road -> Car steering wheel

    etc.

    ( )2),(min j

    jj

    wywxf

    Paul Viola 1999 Machine Learning

    Schemes for motivating regressionl Prior assumptions

    Find the best polynomial which fits the data:

    Or find the best neural network, or ???

    l Bayesian Approach Find the most likely function:

    K+++= 2210

    ),( xwxwwwxf ( )2

    ),(min j

    jj

    wywxf

    ( )},{)()|},({

    max}),{|(maxjj

    jj

    f

    jj

    f yxp

    fpfyxpyxfp =

  • 8/9/2019 Machine Learning Course

    156/326

    4

    Paul Viola 1999 Machine Learning

    Bayesian framework captures

    many approaches

    ( )},{)()|},({

    max}),{|(maxjj

    jj

    f

    jj

    f yxp

    fpfyxpyxfp =

    ( )},{log)(log)|},({logmax jjjjf

    yxpfpfyxp +

    ( )2)(

    ))((log

    )|,(log

    )|,(log)},({log

    =

    =

    =

    =

    j

    jj

    j

    jj

    j

    jj

    j

    jjjj

    yxfc

    yxfG

    fyxp

    fyxpfyxp

    =otherwise0

    polyaisif)(

    ffp

    The polynomial that

    fits the data best is the

    most likely function

    Paul Viola 1999 Machine Learning

    Bayesian framework captures

    many approaches

    ( )2)(

    )},({log

    =j

    jj

    jj

    yxfc

    fyxp

    =

    2

    )(logx

    ffp

    The function which both fits the data and has

    a small derivative is the most likely

    =

    2

    2

    2

    )(log x

    f

    fpAlso popular

  • 8/9/2019 Machine Learning Course

    157/326

    5

    Paul Viola 1999 Machine Learning

    A closer look...

    ( )

    +=2

    2)()(

    x

    fyxfcfC

    j

    jj

    =

    =

    1

    3

    1

    10

    5

    1

    YX

    Data

    l How do we minimize this function? The set of possible functions is infinite

    The space of functions is infinite dimensional

    l Constrain f to be: polynomial, sum of exponentials, etc??

    l What about unconstrained solutions...

    Paul Viola 1999 Machine Learning

    l Butfis not a scalar!!

    l In factfis more like an infinite dimensional vector.

    l If f were a finite vector:

    0)( =

    df

    fdC

    A slightly simpler problem

    0)( =

    jf

    fCj

    Can not reduce C()

    by adjusting any of the

    parameters of f.

    ( )

    +=2

    2)()(

    x

    fyxfcfC

    j

    jj

  • 8/9/2019 Machine Learning Course

    158/326

    6

    Paul Viola 1999 Machine Learning

    We could approximate f .

    0 0 . 1 0 . 2 0. 3 0.4 0.5 0. 6 0.7 0.8 0.9 1

    -1

    -0.8

    -0.6

    -0.4

    -0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.1 0.2 0.3 0.4 0.5 0 . 6 0. 7 0.8 0.9 1

    -1

    -0.8

    -0.6

    -0.4

    -0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.1 0.2 0.3 0.4 0.5 0 . 6 0.7 0.8 0. 9 1

    -1

    -0.8

    -0.6

    -0.4

    -0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.1 0. 2 0.3 0.4 0.5 0 . 6 0. 7 0.8 0.9 1

    -1

    -0.8

    -0.6

    -0.4

    -0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    10 20

    100 1000

    Paul Viola 1999 Machine Learning

    2 4 6 8 100

    0.5

    1

    1.5

    2

    2.5

    3

    Looking for smooth solutions...

    0 2 4 6 8 101

    1.5

    2

    2.5

    3

    Intuition

    ? ?

  • 8/9/2019 Machine Learning Course

    159/326

    7

    Paul Viola 1999 Machine Learning

    Setting up the problem

    W = 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

    Y =1

    0 0 0 3 0 0 0 0 1

    YWWWF TT 1)( =

    ( )2YWFCost =

    Not Invertible!

    F =1

    22 0 0 3 0 0 6 0 1

    =

    Paul Viola 1999 Machine Learning

    Conditioning the Problem

    YWIWWF TT 1)( +=

    ( ) FFYWFCost T+= 2 =i

    i

    TfFF

    2

    Small solution

    vectors are best.F =

    1 0 0

    0 3 0 0 0 0 1

    Y =1

    0 0

    0 3 0 0 0 0 1

    2 4 6 8 100

    0.5

    1

    1.5

    2

    2.5

    3

  • 8/9/2019 Machine Learning Course

    160/326

    8

    Paul Viola 1999 Machine Learning

    and the winner is?

    2 4 6 8 100

    1

    2

    3

    4

    5

    6

    2 4 6 8 100

    0.5

    1

    1.5

    2

    2.5

    3

    2 4 6 8 100

    0.5

    1

    1.5

    2

    2.5

    3

    This is not always true

    remember to think like a

    Bayesian

    Paul Viola 1999 Machine Learning

    Smooth is Good: Regularization

    l Alternative way to motivate Kernel Networks.

    ( ) xdxfx

    yxf

    fSmoothnessfErrorfCost

    j

    jj )(

    )(

    )()()(

    22

    +=

    +=

  • 8/9/2019 Machine Learning Course

    161/326

    9

    Paul Viola 1999 Machine Learning

    Setting up the Problem

    W = 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

    Y =1

    0 0 0

    3 0 0 0 0 1

    D = 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0

    0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1

    YWDDWWF TTT 1)( +=

    ( ) ( )22 DFTWFCost +=

    F=1.0000

    1.5000 2.0000 2.5000

    3.0000 2.6000 2.2000 1.8000 1.4000 1.0000

    Paul Viola 1999 Machine Learning

    Need to find lambda F=1.0000

    1.5000 2.0000 2.5000 3.0000 2.6000 2.2000 1.8000 1.4000 1.0000

    001.0=

    2 4 6 8 100

    0.5

    1

    1.5

    2

    2.5

    3

    10=

    F = 1.6000

    1.6600 1.7200 1.7800 1.8400 1.7840 1.7280 1.6720 1.6160 1.5600

    0.3

    2 4 6 8 100

    0.5

    1

    1.5

    2

    2.5

    3

    1.8

  • 8/9/2019 Machine Learning Course

    162/326

    10

    Paul Viola 1999 Machine Learning

    Derivative Order Controls Shape

    1 0 2 0 30 40 500

    0. 5

    1

    1. 5

    2

    2. 5

    3

    1 0 2 0 30 40 500

    0. 5

    1

    1. 5

    2

    2. 5

    3

    2

    2

    2

    x

    f

    2

    x

    f

    Paul Viola 1999 Machine Learning

    A Closer Look

    1 0 2 0 30 40 50

    -1

    -0.5

    0

    0. 5

    1

    1. 5

    2

    2. 5

    3

    1 0 2 0 30 40 50

    -0.5

    0

    0. 5

    1

    1. 5

    2

    2. 5

    3

    Linear + Kinks Cubics + Kinks

  • 8/9/2019 Machine Learning Course

    163/326

    11

    Paul Viola 1999 Machine Learning

    Look at the regularizer...

    D = 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 1 -1

    YWDDWWF TTT 1)( +=

    ( ) ( )22 DFTWFCost +=

    D * D =

    1 -1 0 0 0 0 0 0 0 0

    -1 2 -1 0 0 0 0 0 0 0

    0 -1 2 -1 0 0 0 0 0 0

    0 0 -1 2 -1 0 0 0 0 0

    0 0 0 -1 2 -1 0 0 0 0 0 0 0 0 -1 2 -1 0 0 0

    0 0 0 0 0 -1 2 -1 0 0

    0 0 0 0 0 0 -1 2 -1 0

    0 0 0 0 0 0 0 -1 2 -1

    0 0 0 0 0 0 0 0 -1 1

    Paul Viola 1999 Machine Learning

    Second Deriv -> Fourth Deriv

    D =

    1 -1 0 0 0 0 0 0 0 0

    -1 2 -1 0 0 0 0 0 0 0

    0 -1 2 -1 0 0 0 0 0 0

    0 0 -1 2 -1 0 0 0 0 0

    0 0 0 -1 2 -1 0 0 0 0

    0 0 0 0 -1 2 -1 0 0 0

    0 0 0 0 0 -1 2 -1 0 0

    0 0 0 0 0 0 -1 2 -1 0

    0 0 0 0 0 0 0 -1 2 -1

    0 0 0 0 0 0 0 0 -1 1

    D' * D =

    1 -2 1 0 0 0 0 0 0 0

    -2 5 -4 1 0 0 0 0 0 0

    1 -4 6 -4 1 0 0 0 0 0

    0 1 -4 6 -4 1 0 0 0 0

    0 0 1 -4 6 -4 1 0 0 0

    0 0 0 1 -4 6 -4 1 0 0

    0 0 0 0 1 -4 6 -4 1 0

    0 0 0 0 0 1 -4 6 -4 1

    0 0 0 0 0 0 1 -4 5 -2

    0 0 0 0 0 0 0 1 -2 1

  • 8/9/2019 Machine Learning Course

    164/326

  • 8/9/2019 Machine Learning Course

    165/326

    13

    Paul Viola 1999 Machine Learning

    Still Piecewise Cubic

    10 20 30 40 50

    -1

    -0.8

    -0.6

    -0.4

    -0.2

    0

    0.2

    0.4

    0.6

    0.8

    10 20 30 40 50

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    Paul Viola 1999 Machine Learning

    Smoothness is easily controlled

    10 20 30 40 50

    0

    0. 1

    0. 2

    0. 3

    0. 4

    0. 5

    0. 6

    0. 7

    0. 8

    0. 9

    10 20 30 40 50

    0

    0. 1

    0. 2

    0. 3

    0. 4

    0. 5

    0. 6

    0. 7

    0. 8

    0. 9

    10 20 30 40 50

    0

    0. 1

    0. 2

    0. 3

    0. 4

    0. 5

    0. 6

    0. 7

    0. 8

    0. 9

    10 20 30 40 50

    0

    0. 1

    0. 2

    0. 3

    0. 4

    0. 5

    0. 6

    0. 7

    0. 8

    0. 9

    10 20 30 40 50

    0

    0. 1

    0. 2

    0. 3

    0. 4

    0. 5

    0. 6

    0. 7

    0. 8

    0. 9

    10 20 30 40 50

    0

    0. 1

    0. 2

    0. 3

    0. 4

    0. 5

    0. 6

    0. 7

    0. 8

    0. 9

  • 8/9/2019 Machine Learning Course

    166/326

    14

    Paul Viola 1999 Machine Learning

    Regularization to RBFs

    l Alternative way to motivate RBFs

    )()()( ySmoothnessyErroryE +=

    Gaussian

    Every

    Training Point

    Paul Viola 1999 Machine Learning

    roblem: To many centers

    (Old solutions )l One per data point can be way to many

    l Choose a random subset of the points Hope you dont get unlucky

    l Distribute them based on the density of points Perhaps EM clustering

  • 8/9/2019 Machine Learning Course

    167/326

    15

    Paul Viola 1999 Machine Learning

    Too Many Centers 2

    l Put them where you need them To best approximate your function

    l Compute the derivative of E(y) w.r.t. the centers This gets very hairy and does not work well

    Too many local minima -- no small weight trick

    Paul Viola 1999 Machine Learning

    Too Many Centers 3l Support Vector Regression

    xdxfx

    yxf

    fSmoothnessfErrorfCost

    j

    jj )(

    )(

    )()()(

    2

    +=

    +=

    =j

    jj xxKwxf ),()(

    Many js

    are zero!!!

  • 8/9/2019 Machine Learning Course

    168/326

    1

    Paul Viola 1999 Machine Learning

    6.891 Machine Learning and Neural

    Networks

    Lecture 13:

    Kernel Networks

    on to Unsupervised Learning

    Paul Viola 1999 Machine Learning

    Newsl Quizes are graded

    Each problem has been graded.

    ** The overall score for the quiz is being determined. We ran out of time last night.

    l Course grading: (approximate) Psets: 35%

    Quiz: 20%

    Final: 30%

    Project: 10%

    Participation: 5%

  • 8/9/2019 Machine Learning Course

    169/326

    2

    Paul Viola 1999 Machine Learning

    Pset 3

    You are doing spectacularly well

    Paul Viola 1999 Machine Learning

    Exams

  • 8/9/2019 Machine Learning Course

    170/326

    3

    Paul Viola 1999 Machine Learning

    Grading alternatives

    Paul Viola 1999 Machine Learning

    Review & Overviewl Lecture 12:

    Trying to find smooth functions.

    Requiring smoothness simplifies functions: 1st deriv -> piecewise linear; 2nd deriv -> cubic

    l Finish off Regression

    l Begin unsupervised learning. PCA, etc

  • 8/9/2019 Machine Learning Course

    171/326

  • 8/9/2019 Machine Learning Course

    172/326

  • 8/9/2019 Machine Learning Course

    173/326

    6

    Paul Viola 1999 Machine Learning

    Cubics are similar...

    ( )

    +=

    j

    jj

    x

    fyxffC

    2

    2

    22

    )()(

    =

    j

    j

    jxxKbxf ),()(

    ?

    2)(),( jjj xxxxxxK =)(),(

    4

    4j

    j

    xx

    x

    xxK=

    =j

    j

    j xxaxf )()(

    =j

    j

    jx

    xxKbxf

    4

    4 ),()(

    Paul Viola 1999 Machine Learning

    Can also get gaussian kernels

    l if you want them!!

    )()()( ySmoothnessyErroryE +=

    Gaussian

    Every

    Training Point

  • 8/9/2019 Machine Learning Course

    174/326

    7

    Paul Viola 1999 Machine Learning

    Smoothness is easily controlled

    10 20 30 40 50

    0

    0. 1

    0. 2

    0. 3

    0. 4

    0. 5

    0. 6

    0. 7

    0. 8

    0. 9

    10 20 30 40 50

    0

    0. 1

    0. 2

    0. 3

    0. 4

    0. 5

    0. 6

    0. 7

    0. 8

    0. 9

    10 20 30 40 50

    0

    0. 1

    0. 2

    0. 3

    0. 4

    0. 5

    0. 6

    0. 7

    0. 8

    0. 9

    10 20 30 40 50

    0

    0. 1

    0. 2

    0. 3

    0. 4

    0. 5

    0. 6

    0. 7

    0. 8

    0. 9

    10 20 30 40 50

    0

    0. 1

    0. 2

    0. 3

    0. 4

    0. 5

    0. 6

    0. 7

    0. 8

    0. 9

    10 20 30 40 50

    0

    0. 1

    0. 2

    0. 3

    0. 4

    0. 5

    0. 6

    0. 7

    0. 8

    0. 9

    Paul Viola 1999 Machine Learning

    roblem: To many centers

    (Old solutions )l One per data point can be way too many

    l Choose a random subset of the points Hope you dont get unlucky

    l Distribute them based on the density of points Perhaps EM clustering

  • 8/9/2019 Machine Learning Course

    175/326

    8

    Paul Viola 1999 Machine Learning

    Too Many Centers 2

    l Put them where you need them To best approximate your function

    l Compute the derivative of C(f) w.r.t. the centers This gets very hairy and does not work well

    Too many local minima -- no small weight trick

    Paul Viola 1999 Machine Learning

    Support Vector Regressionl Support Vector Regression

    bxwxf T +=)( wwTmin

    ( )

    ( )

    +

    +

    bxwy

    ybxw

    jTj

    jjT

  • 8/9/2019 Machine Learning Course

    176/326

  • 8/9/2019 Machine Learning Course

    177/326

    1

    Paul Viola 1999 Machine Learning

    New Topic: Unsupervised Learning

    l What can you do to understand data when youhave no labels? Find unusual structure in the data.

    Find simplifications of the data.

    l Find the clusters in the data: Fit a mixture of gaussians

    Been there, done that.

    l Reduce the dimensionality of the data: Find a linear projection from high to low dimensions

    l These all amount to density estimationl There are many other approaches

    Build a tree which captures the data, etc.

    Paul Viola 1999 Machine Learning

    8 bits per pixel .36 bits per pixel

  • 8/9/2019 Machine Learning Course

    178/326

    1

    Paul Viola 1999 Machine Learning

    Paul Viola 1999 Machine Learning

  • 8/9/2019 Machine Learning Course

    179/326

    1

    Paul Viola 1999 Machine Learning

    Paul Viola 1999 Machine Learning

  • 8/9/2019 Machine Learning Course

    180/326

    1

    Paul Viola 1999 Machine Learning

    6.891 Machine Learning and Neural

    Networks

    Lecture 14:

    on to Unsupervised Learning

    Paul Viola 1999 Machine Learning

    Newsl I will try to give you a feeling for where we are

    headed: Next 4 lectures

    Bayes Nets / Graphical Models / Boltzman Machines /HMMs

    After that a series of topics ( from papers).

  • 8/9/2019 Machine Learning Course

    181/326

    2

    Paul Viola 1999 Machine Learning

    Review & Overview

    l Lecture 13: The end of regression

    l Begin unsupervised learning. PCA, etc

    Paul Viola 1999 Machine Learning

    New Topic: Unsupervised Learningl What can you do to understand data when you

    have no labels? Find unusual structure in the data.

    Find simplifications of the data.

    l Find the clusters in the data: Fit a mixture of gaussians

    Been there, done that.

    l Reduce the dimensionality of the data:

    Find a linear projection from high to low dimensionsl These all amount to density estimation

    l There are many other approaches Build a tree which captures the data, etc.

  • 8/9/2019 Machine Learning Course

    182/326

    3

    Paul Viola 1999 Machine Learning

    Exploratory Data Analysis

    l Machine learning is simply not that smart

    l It is still very important to look at the data.

    l But when there are millions of examples andthousands of dimensions you cannot look at thedata.

    l It is very important to summarize the data Summarize the statistics

    Reduce dimensionality

    Paul Viola 1999 Machine Learning

    Example: Mixture of Gaussiansl Start out with a many training points (1000s)

    Distributed in a clumpy fashion.

    l Replace with a few summary clusters (10s) One for each clump.

    l Why? Better understand/summarize the data

    Demographics Data: growing number of stay at home dadsl Salary < 1000; Children > 0; Family Salary > 20,000, etc.

    Astronomical Data: Unusual distribution of stars near galaxy

    Extract symbolsl one cluster per word (in speech), one cluster per letter (in writing)

    Speed learning Learn with the clusters rather than all the data approximate

  • 8/9/2019 Machine Learning Course

    183/326

    4

    Paul Viola 1999 Machine Learning

    Example of Clustering

    l Can I get some examples of clustering???

    l PDP??

    l Andrew Moore

    Paul Viola 1999 Machine Learning

    l Regression & Kernel Networks

    Learning time is cubic in the number of examples.

    Need to solve the linear system to find weights.

    Performance time is linear in the number of examples

    10,000 examples -> 100 clusters??

    l Problem: these clusters are not optimal

    Other locations may be much more effective forapproximating the target function.

    Clusters congregate near data not at the margin!

    Speed Learning???

    =j

    j

    jxxKbxf ),()(

  • 8/9/2019 Machine Learning Course

    184/326

    5

    Paul Viola 1999 Machine Learning

    Support Vector Machines

    l Claims to choose the optimal set of examples 10,000 -> 100

    Performance is optimal

    l But, training time is still cubic in the number ofexamples.

    l Though some new algorithms are beginning to work

    more quickly.

    =j

    j

    jxxKbxf ),()(

    Paul Viola 1999 Machine Learning

    Example: Dimensionality Reductionl It is difficult to visualize 10, 20, 1000 dimensions

    l Worse, there is reason to believe that it can bevery difficult to learn in high dimensions.

    l Curse of dimensionality: As the dimensionality of the data grows, the average

    distance between points also grows.

    As a result it is difficult to get an accurate localestimate for what is going on.

  • 8/9/2019 Machine Learning Course

    185/326

    6

    Paul Viola 1999 Machine Learning

    Curse of Dimensionality of Nearest

    Neighbor

    l How far is it to your nearest neighbor?? Easier Question: How far do you have to look before

    expecting 1 neighbor.

    Assume points are distributed in a sphere of unit radius

    The point in the center of the sphere should have thelargest number of neighbors. How far do we have to go

    to find 1 neighbor on average??

    k

    kkrcrS =))((Vol

    Paul Viola 1999 Machine Learning

    Reducing Dimensionality Linearly

    l Where W has fewer rows than columns

    l What is the right W?? One that preserves as much information as possible.

    Ah, but what is information??

    Wxy=

    ( )( )2min WxWxEW

    =

    M

    2

    1

    e

    e

    W

  • 8/9/2019 Machine Learning Course

    186/326

    7

    Paul Viola 1999 Machine Learning

    First Eigenvector preserves more info...

    Paul Viola 1999 Machine Learning

    256,000Numbers

    4,000Numbers

  • 8/9/2019 Machine Learning Course

    187/326

  • 8/9/2019 Machine Learning Course

    188/326

    9

    Paul Viola 1999 Machine Learning

    Dimensionality Reduction

    Paul Viola 1999 Machine Learning

    Cottrell and Metcalfe

  • 8/9/2019 Machine Learning Course

    189/326

    1

    Paul Viola 1999 Machine Learning

    Information Theory for Signal

    Separation

    l The Cocktail Party Problem

    l Many Speakers -- the signals a hopelessly mixed

    Microphones

    Sound

    Sources

    Paul Viola 1999 Machine Learning

    Unmixed

  • 8/9/2019 Machine Learning Course

    190/326

    1

    Paul Viola 1999 Machine Learning

    ets look at data

    Figures from Christian

    Unmixed Mixed

    PCAUnmixed

    Paul Viola 1999 Machine Learning

    Mathematical Assumptions

    l Assumptions:

    Sounds Travels instantaneously Sound Mixes Linearly

    Signals are independent

    S

    s t

    s t

    s t

    M

    m t

    m t

    m t

    A

    a a a

    a a a

    a a a

    =

    =

    =

    1

    2

    3

    1

    2

    3

    11 12 13

    21 22 23

    31 32 33

    ( )

    ( )

    ( )

    ( )

    ( )

    ( )

    M AS=

  • 8/9/2019 Machine Learning Course

    191/326

    1

    Paul Viola 1999 Machine Learning

    The Unmixing Problem

    l We would like to undo the mixing...

    $S A AS = 1

    Paul Viola 1999 Machine Learning

    Reducing Dimensionality Non-Linearly

    l Where W has fewer rows than columns

    l What is the right W?? One that preserves as much information as possible.

    Ah, but what is information??

    )(Wxgy=

    ( )xWxgMIW

    ),(max

    =

    components

    tindependenW

  • 8/9/2019 Machine Learning Course

    192/326

    1

    Paul Viola 1999 Machine Learning

    ICAUnmixed

    Paul Viola 1999 Machine Learning

    Learning Rule

    WuyW

    WWxyW

    WWxyWW

    T

    TT

    TTT

    )21(

    )21(

    ))21((

    +=

    +=

    +=

  • 8/9/2019 Machine Learning Course

    193/326

    1

    Paul Viola 1999 Machine Learning

    6.891 Machine Learning and Neural

    Networks

    Lecture 15:

    Reasoning and Learning on Discrete Data

    Bayes Nets

    Paul Viola 1999 Machine Learning

    Newsl Final Problem Set will be ready tomorrow

    Mostly Bayes Nets

    l Please begin to think about your final project

  • 8/9/2019 Machine Learning Course

    194/326

    2

    Paul Viola 1999 Machine Learning

    Review & Overview

    l Lecture 14: Principal Components Analysis

    A low dimensional projection is can summarize data

    Independent Components Analysis An alternative to PCA which can pick out the independent

    sources of data.

    l Bayes Nets Meeting of the minds

    Artificial Intelligence and Machine Learning

    Represents symbolic knowledge and reasoning Principled mechanism for inference and learning

    Bayes Rule

    Paul Viola 1999 Machine Learning

    Artificial Intelligencel Build systems that reason about the world:

    Diagnosis Why wont my car start?

    Goal directed behavior How can I get from here to the White House?

    Space Probe: How do I change orbit, take photos of Mars,and communicate with Earth in the next 5 minutes?

    How can I symbolically integrate this function?

    Game Playing

    How can I beast Kasparov?l Biases:

    Symbolic data and symbolic problems (not continuous)

    No representation of uncertainty or probability.

  • 8/9/2019 Machine Learning Course

    195/326

    3

    Paul Viola 1999 Machine Learning

    Techniques in Artificial Intelligence

    l Write down a set of rules that govern the world If I get on a plane to Wash. DC then I will end up in DC.

    If I take a taxi to Logan then I will end up at Logan.

    If my car is out of fuel then it wont start.

    If my starter motor is broken then it wont start.

    The integral of sin(x) is - cos(x).

    The integral of 2