Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

  • Upload
    tuytm2

  • View
    226

  • Download
    1

Embed Size (px)

Citation preview

  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    1/38

    Lecture 5: Linear Regression with Regularization

    CSC 84020 - Machine Learning

    Andrew Rosenberg

    February 19, 2009

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    2/38

    Today

    Linear Regression with Regularization

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    3/38

    Recap

    Linear Regression

    Given a target vector t, and data matrix X.

    Goal: Identify the best parameters for a regression functiony = w0 +

    N

    i=1 wixi

    w = (XTX)1XTt

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    4/38

    Closed form solution for linear regression

    This solution is based on

    Maximum Likelihood estimation under an assumption ofGaussian Likelihood

    Empirical Risk Minimization under an assumption of SquaredError

    The extension of Basis Functions gives linear regression significantpower.

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    5/38

    Revisiting overfitting

    Overfitting occurs when a model captures idiosyncrasies of theinput data, rather than generalizing.

    Too many parameters relative to the amount of training data

    For example, an order-N polynomial can be exact fit to N+ 1 datapoints.

    O

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    6/38

    Overfitting Example

    O fi i E l

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    7/38

    Overfitting Example

    A idi O fi i

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    8/38

    Avoiding Overfitting

    Ways of detecting/avoiding overfitting.

    Use more data

    Evaluate on a parameter tuning set

    Regularization

    Take a Bayesian approach

    R l i ti

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    9/38

    Regularization

    In a Linear Regression model, overfitting is characterized by large

    parameters.

    M= 0 M= 1 M= 3 M = 9

    w0 0.19 0.82 0.31 0.35w1 -1.27 7.99 232.37

    w2 -25.43 -5321.83w3 17.37 48568.31w4 -231639.30w5 640042.26w6 -1061800.52w7 1042400.18w8 -557682.99w9 125201.43

    R l i ti

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    10/38

    Regularization

    Introduce a penalty term for the size of the weights.

    Unregularized Regression

    E(w) =1

    2

    N1

    n=0{

    tn

    y(xn,w)

    }2

    Regularized Regression(L2-Regularization or Ridge Regularization)

    E(w) =1

    2

    N1n=0

    (tn y(xn,w))2 + 2w2

    Note: Large leads to higher complexity penalization.

    L t S R i ith L2 R l i ti

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    11/38

    Least Squares Regression with L2-Regularization

    w(E(w)) = 0

    Least Squares Regression with L2 Regularization

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    12/38

    Least Squares Regression with L2-Regularization

    w(E(w)) = 0

    w1

    2

    N1

    i=0

    (y(xi,w)ti)

    2 +

    2w2 = 0

    w

    1

    2t Xw2 +

    2w2

    = 0

    Least Squares Regression with L2 Regularization

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    13/38

    Least Squares Regression with L2-Regularization

    w(E(w)) = 0

    w1

    2

    N1

    i=0

    (y(xi,w)ti)

    2 +

    2w2 = 0

    w

    1

    2t Xw2 +

    2w2

    = 0

    w12

    (t Xw)T(t Xw) + 2

    wTw

    = 0

    Least Squares Regression with L2 Regularization

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    14/38

    Least Squares Regression with L2-Regularization

    w1

    2 (t Xw)T

    (t Xw) +

    2 wT

    w

    = 0

    Least Squares Regression with L2 Regularization

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    15/38

    Least Squares Regression with L2-Regularization

    w1

    2 (t Xw)

    T

    (t Xw) +

    2 wT

    w

    = 0

    XTt + XTXw + w

    2wT

    w

    = 0

    Least Squares Regression with L2-Regularization

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    16/38

    Least Squares Regression with L2-Regularization

    w1

    2 (t Xw)

    T

    (t Xw) +

    2 w

    T

    w

    = 0

    XTt + XTXw + w

    2wT

    w

    = 0

    XTt + XTXw + w = 0

    Least Squares Regression with L2-Regularization

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    17/38

    Least Squares Regression with L2 Regularization

    w1

    2 (t Xw)

    T

    (t Xw) +

    2 w

    T

    w

    = 0

    XTt + XTXw + w

    2wT

    w

    = 0

    XTt + XTXw + w = 0

    XTt + XTXw + Iw = 0

    Least Squares Regression with L2-Regularization

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    18/38

    Least Squares Regression with L2 Regularization

    w1

    2 (t Xw)

    T

    (t Xw) +

    2 w

    T

    w

    = 0

    XTt + XTXw + w

    2wT

    w

    = 0

    XTt + XTXw + w = 0

    XTt + XTXw + Iw = 0

    XTt + (XTX + I)w = 0

    Least Squares Regression with L2-Regularization

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    19/38

    Least Squares Regression with L2 Regularization

    w1

    2 (t Xw)

    T

    (t Xw) +

    2 w

    T

    w

    = 0

    XTt + XTXw + w

    2wT

    w

    = 0

    XTt + XTXw + w = 0

    XTt + XTXw + Iw = 0

    XTt + (XTX + I)w = 0

    (XT

    X + I)w = XT

    t

    Least Squares Regression with L2-Regularization

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    20/38

    Least Squares Regression with L2 Regularization

    w1

    2 (t Xw)

    T

    (t Xw) +

    2 w

    T

    w

    = 0

    XTt + XTXw + w

    2wT

    w

    = 0

    XTt + XTXw + w = 0

    XTt + XTXw + Iw = 0

    XTt + (XTX + I)w = 0

    (XT

    X + I)w = XT

    t

    w = (XTX + I)1XTt

    Regularization Results

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    21/38

    Regularization Results

    Regularization Results

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    22/38

    g

    Further Regularization

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    23/38

    g

    Regularization Approaches

    L2-Regularization

    E(w) =1

    2

    N1n=0

    (tn y(xn,w))2 + 2w2

    L1-Regularization

    E(w) =1

    2

    N1n=0

    (tn y(xn,w))2 + |w|1

    L0-Regularization

    E(w) =1

    2

    N1

    n=0

    (tn y(xn,w))2 + N1

    n=0

    (wn = 0)

    The L0-norm represents the optimal subset of features needed bya Regression model.

    Further Regularization

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    24/38

    g

    Regularization Approaches

    L2-Regularization Closed form in polynomial time.

    E(w) =1

    2

    N1n=0

    (tn y(xn,w))2 + 2w2

    L1-Regularization

    E(w) =1

    2

    N1n=0

    (tn y(xn,w))2 + |w|1

    L0-Regularization

    E(w) =1

    2

    N1

    n=0

    (tn y(xn,w))2 + N1

    n=0

    (wn = 0)

    The L0-norm represents the optimal subset of features needed bya Regression model.

    How can we optimize of these functions?

    Further Regularization

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    25/38

    g

    Regularization Approaches

    L2-Regularization

    E(w) =1

    2

    N1n=0

    (tn y(xn,w))2 + 2w2

    L1-Regularization Can be approximated in poly-time

    E(w) =1

    2

    N1n=0

    (tn y(xn,w))2 + |w|1

    L0-Regularization

    E(w) =1

    2

    N1

    n=0

    (tn y(xn,w))2 + N1

    n=0

    (wn = 0)

    The L0-norm represents the optimal subset of features needed bya Regression model.

    How can we optimize of these functions?

    Further Regularization

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    26/38

    g

    Regularization Approaches

    L2-Regularization

    E(w) =1

    2

    N1n=0

    (tn y(xn,w))2 + 2w2

    L1-Regularization

    E(w) =1

    2

    N1

    n=0

    (tn y(xn,w))2 + |w|1

    L0-Regularization NP complete optimization

    E(w) =1

    2

    N1

    n=0

    (tn y(xn,w))2 + N1

    n=0

    (wn = 0)

    The L0-norm represents the optimal subset of features needed bya Regression model.

    How can we optimize of these functions?

    Curse of Dimensionality

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    27/38

    y

    Curse of Dimensionality

    Increasing the dimensionality of the feature space exponentiallyincreases the data needs.Note: The dimensionality of the feature space = The number of

    features.What is the message of this?

    Models should be small relative to the amount of availabledata.

    Dimensionality Reduction techniques feature selection can

    help.L0-regularization is feature selection for linear models.L1- and L2-regularizations approximate feature selection andregularize the function.

    Curse of Dimensionality Example

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    28/38

    Assume a cell requires 100 data points to generalize properly, and3-ary multinomial features.

    One dimension requires 300 data points

    Two Dimensions requires 900 data points

    Three Dimensions requires 2,700 data points

    In this example, for D-dimensional model fitting, the datarequirements are 3D 10.

    Argument against the Kitchen Sink approach.

    Bayesians v. Frequentists

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    29/38

    What is a Probability?

    Bayesians v. Frequentists

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    30/38

    What is a Probability?

    The Frequentist position

    A probability is the likelihood that an event will happen.

    It is approximated as the ratio of the number of times the eventhappened to the total number of events.

    Assessment is very important to select a model.

    Point Estimates are fine nN

    Bayesians v. Frequentists

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    31/38

    What is a Probability?

    The Frequentist position

    A probability is the likelihood that an event will happen.

    It is approximated as the ratio of the number of times the eventhappened to the total number of events.

    Assessment is very important to select a model.

    Point Estimates are fine nN

    The Bayesian position

    A probability is the degree of believability that the event will happen.

    Bayesians require that probabilities be conditioned on data, p(y|x).

    The Bayesian approach is optimal, given a good model, and good priorand good loss function dont worry about assessment as much.

    Bayesians say: if you are ever making a point estimate, youve made amistake. The only valid probabilities are posteriors based on evidencegiven some prior.

    Bayesian Linear Regression

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    32/38

    In the previous derivation of the linear regression optimization, we

    made point estimates for the weight vector, w.

    Bayesians would say stop right there. Use a distribution overw to estimate the parameters.

    p(w|) = N(w|0, 1I) = 2(M+1)/2

    exp2 wTw

    is a hyperparameterover w, where is the precision or inversevariance of the distribution.

    So, optimize

    p(w|x, t,, ) p(t|x,w, )p(w|)

    Bayesian Linear Regression

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    33/38

    p(w|x, t,, ) p(t|x,w, )p(w|)Again, optimizing the log likelihood yields a simpler solution.

    ln p(t

    |x,w, ) + ln p(w

    |)

    p(t|x,w, ) =N1n=0

    2

    exp

    2(tn y(xn,w))2

    ln p(t|x,w, ) = N2

    ln N2

    ln 2 2

    N1n=0

    (tn y(xn,w))2

    Bayesian Linear Regression

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    34/38

    p(w

    |x, t,, )

    p(t

    |x,w, )p(w

    |)

    Again, optimizing the log likelihood yields a simpler solution.

    ln p(t|x,w, ) + ln p(w|)

    ln p(t|x,w, ) = N2

    ln N2

    ln 2 2

    N1n=0

    (tn y(xn,w))2

    p(w|) = N(w|0, 1I) = 2(M+1)/2

    exp2 wTw

    ln p(w|) = M+ 12

    ln M+ 12

    ln 2 2

    wTw

    Bayesian Linear Regression

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    35/38

    p(w|x, t,, ) p(t|x,w, )p(w|)Again, optimizing the log likelihood yields a simpler solution.

    ln p(t|x,w, ) + ln p(w|)

    ln p(t|x,w, ) = N2

    ln N2

    ln 2 2

    N1n=0

    (tn y(xn,w))2

    ln p(w|) = M+ 12

    ln M+ 12

    ln 2 2

    wTw

    Bayesian Linear Regression

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    36/38

    p(w|x, t,, ) p(t|x,w, )p(w|)Again, optimizing the log likelihood yields a simpler solution.

    ln p(t|x,w, ) + ln p(w|)

    ln p(t|x,w, ) = N2

    ln N2

    ln 2 2

    N1n=0

    (tn y(xn,w))2

    ln p(w|) =M+ 1

    2 ln M+ 1

    2 ln 2

    2 wT

    w

    ln p(t|x,w, ) + ln p(w|) = 2

    N1

    n=0(tn y(xn,w))2 +

    2wTw

    Broader Context

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    37/38

    Overfitting is bad.

    Bayesians v. Frequentists.

    Does it matter which camp you lie in?

    Not particularly, but Bayesian approaches allow us some usefulinteresting and principled tools.

    Bye

    http://find/http://goback/
  • 8/3/2019 Andrew Rosenberg- Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

    38/38

    NextCategorization

    Logistic RegressionNaive Bayes

    http://find/http://goback/