32
Regression www.biostat.wisc.edu/~dpage/cs760/

Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

Regression

www.biostat.wisc.edu/~dpage/cs760/

Page 2: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

Goals for the lecture

you should understand the following concepts •  linear regression •  RMSE, MAE, and R-square •  logistic regression •  convex functions and sets •  ridge regression (L2 penalty) •  lasso (L1 penalty): least absolute shrinkage and

selection operator •  lasso by proximal method (ISTA) •  lasso by coordinate descent

Page 3: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

Q[I<g�.IOgIhhQ][

� Q[I<g�gIOgIhhQ][�<hhkZIh�jP<j�jPI�gIY<jQ][�DIjqII[�jPI�IrdIEjIG�p<YkI�]N�GIdI[GI[j�p<gQ<DYI�9�<[G�jPI�p<YkI�]N�Q[GIdI[GI[j�p<gQ<DYI¥h¦�8��Qh�YQ[I<g�

Page 4: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

$gGQ[<gs� I<hj�/fk<gI�¥$ /¦

� �

Page 5: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

$gGQ[<gs� I<hj�/fk<gI�¥$ /¦

� �

.!/� !��$DWIEjQpI�Nk[EjQ][

Page 6: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

Using Linear Algebra

•  As we go to more variables, notation more complex

•  Use matrix representation and operations, assume all features standardized (standard normal), and assume an additional constant 1 feature as we did with ANNs

•  Given data matrix X with label vector Y

•  Find vector of coefficients β to minimize:

•  ||Xβ – Y|| 2 2

Page 7: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

!kYjQp<gQ<jI� Q[I<g�.IOgIhhQ][

� �

Page 8: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

Evaluation Metrics for Numeric Prediction

•  Root mean squared error (RMSE)

•  Mean absolute error (MAE) – average error

•  R-square (R-squared) •  Historically all were computed on training data, and

possibly adjusted after, but really should cross-validate

Page 9: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

R-square(d)

•  Formulation 1:

•  Formulation 2: square of Pearson correlation coefficient r. Recall for x, y:

R2 = 1�P

i(yi�h( ~xi))

2

Pi(yi�y)2

r =

Pi(xi�x)(yi�y)p

1N

Pi(xi�x)2

p1N

Pi(yi�y)2

1

R2 = 1�P

i(yi�h( ~xi))

2

Pi(yi�y)2

r =

Pi(xi�x)(yi�y)pP

i(xi�x)2

pPi(yi�y)2

1

Page 10: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

Some Observations •  R-square of 0 means you have no model, R-square

of 1 implies perfect model (loosely, explains all variation)

•  These two formulations agree when performed on the training set

•  The do not agree when we do cross-validation, in general, because mean of training set is different from mean of each fold

•  Should do CV and use first formulation, but can be negative!

Page 11: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

Logistic Regression: Motivation  

•  Linear  regression  was  used  to  fit  a  linear  model  to  the  feature  space.  

 •  Requirement  arose  to  do  classifica9on-­‐  make  the  number  of  interest  the  probability  that  a  feature  will  take  a  par9cular  value  given  other  features              P(Y  =  1  |  X)  

 •  So,  extend  linear  regression  for  classifica9on    

Page 12: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

Theory of Logistic Regression  

               

                   

 

y

x

Page 13: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

Likelihood function for logistic regression  

• Naïve  Bayes  Classifica9on  makes  use  of  condi9onal  informa9on  and  computes  full  joint  distribu9on  

•  Logis9c  regression  instead  computes  condi9onal  probability  

• operates  on  Condi&onal  Likelihood  func9on  or  Condi&onal  Log  Likelihood.  

• Condi9onal  likelihood  is  the  probability  of  the  observed  Y  values  condi9oned  on  their  corresponding  X  values  

• Unlike  linear  regression,  no  closed-­‐form  solu9on,  so  we  use  standard  gradient  ascent  to  find  the  weight  vector  w  that  maximizes  the  condi9onal  likelihood  

Page 14: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

Background on Optimization (Bubeck, 2015) Thanks Irene Giacomelli  

1

Introduction

The central objects of our study are convex functions and convex setsin Rn.

Definition 1.1 (Convex sets and convex functions). A set X µ Rn issaid to be convex if it contains all of its segments, that is

’(x, y, “) œ X ◊ X ◊ [0, 1], (1 ≠ “)x + “y œ X .

A function f : X æ R is said to be convex if it always lies below itschords, that is

’(x, y, “) œ X ◊ X ◊ [0, 1], f((1 ≠ “)x + “y) Æ (1 ≠ “)f(x) + “f(y).

We are interested in algorithms that take as input a convex set Xand a convex function f and output an approximate minimum of fover X . We write compactly the problem of finding the minimum of fover X as

min. f(x)s.t. x œ X .

In the following we will make more precise how the set of constraints Xand the objective function f are specified to the algorithm. Before that

232

Page 15: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

(Non-)Convex Function or Set  

Page 16: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

Epigraph (Bubeck, 2015)  

1.2. Basic properties of convexity 235

We introduce now the key notion of subgradients.

Definition 1.2 (Subgradients). Let X µ Rn, and f : X æ R. Theng œ Rn is a subgradient of f at x œ X if for any y œ X one has

f(x) ≠ f(y) Æ g€(x ≠ y).

The set of subgradients of f at x is denoted ˆf(x).

To put it di�erently, for any x œ X and g œ ˆf(x), f is above thelinear function y ‘æ f(x)+g€(y≠x). The next result shows (essentially)that a convex functions always admit subgradients.

Proposition 1.1 (Existence of subgradients). Let X µ Rn be convex,and f : X æ R. If ’x œ X , ˆf(x) ”= ÿ then f is convex. Converselyif f is convex then for any x œ int(X ), ˆf(x) ”= ÿ. Furthermore if f isconvex and di�erentiable at x then Òf(x) œ ˆf(x).

Before going to the proof we recall the definition of the epigraph ofa function f : X æ R:

epi(f) = {(x, t) œ X ◊ R : t Ø f(x)}.

It is obvious that a function is convex if and only if its epigraph is aconvex set.

Proof. The first claim is almost trivial: let g œ ˆf((1 ≠ “)x + “y), thenby definition one has

f((1 ≠ “)x + “y) Æ f(x) + “g€(y ≠ x),f((1 ≠ “)x + “y) Æ f(y) + (1 ≠ “)g€(x ≠ y),

which clearly shows that f is convex by adding the two (appropriatelyrescaled) inequalities.

Now let us prove that a convex function f has subgradients in theinterior of X . We build a subgradient by using a supporting hyperplaneto the epigraph of the function. Let x œ X . Then clearly (x, f(x)) œˆepi(f), and epi(f) is a convex set. Thus by using the SupportingHyperplane Theorem, there exists (a, b) œ Rn ◊ R such that

a€x + bf(x) Ø a€y + bt, ’(y, t) œ epi(f). (1.2)

Page 17: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

•  Show  for  all  real  a  <  b  and  0  ≤  c  ≤  1,      f(ca  +  (1-­‐c)b)  ≤  c  f(a)  +  (1-­‐c)  f(b)    for  following:  

 

•  f(x)=|x|  •  f(x)=x2  •  Not  so  for  f(x)=x3  

•  In  general  x  could  be  a  vector  x  

•  For  gradient  descent,  also  want  f(x)  to  be  con9nuous  differen9able  

•  For  |x|  we  need  proximal  methods,  subgradient  methods,  or  coordinate  descent    

 

       

Page 18: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

Logistic Regression Algorithm  •     

Error  in  es9mate

Page 19: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

•  Regression  (linear  or  logis9c)  is  prone  to  overfi\ng,  especially  when:  

•  there  are  a  large  number  of  features  or    •  when  fit  with  high  order  polynomial  features.  

•  Regulariza9on  helps  combat  overfi\ng  by  having  a  simpler  model.  It  is  used  when  we  want  to  have:  

•  less  varia9on  in  the  different  weights  or  •  smaller  weights  overall  or  •  only  a  few  non-­‐zero  weights(and  thus  features  considered).  

•  Regulariza9on  is  accomplished  by  adding  a  penalty  term  to  the  target  func9on  that  is  being  op9mized.  

•  Two  types  –  L2  and  L1  regulariza9on.    

Regularized Regression  

Page 20: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

•     

L2 regularization  

Page 21: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

•  Called  “ridge  regression”  

•  S9ll  has  a  closed-­‐form  solu9on,  so  even  though  con9nuous  differen9able  and  convex,  don’t  need  gradient  descent  

•  β = (XTX – λI)-1XTY

L2 regularization in linear regression  

Page 22: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

•     

L1 regularization  

Page 23: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

�//$��+I[<Yjs�<h�<� ][hjg<Q[j

Page 24: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

�//$��+I[<Yjs�<h�<�0IgZ�Q[�$��

Obtained by taking Langrangian. Even for linear regression, no closed-form solution. Ordinary gradient ascent also does not work because no derivative. Fastest methods now FISTA and (faster) coordinate descent.

Page 25: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

Proximal Methods

•  f(x) = g(x) + h(x) – g is convex, differentiable – h is convex and decomposable, but not

differentiable – Example: g is squared error, h is lasso

penalty – sum of absolute value terms, one per coefficient (so one per feature)

– Find β to minimize ||Xβ – Y|| + λ||β||1 2 2 { {

g h

Page 26: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

Proximal Operator: Soft-Thresholding

ISTAConsider lasso criterion

f(x) =1

2

ky �Axk2| {z }

g(x)

+

.

.�kxk1| {z }h(x)

Prox function is now

prox

t

(x) = argmin

z2Rn

1

2tkx� zk2 + �kzk1

= S�t

(x)

where S�

(x) is the soft-thresholding operator,

[S�

(x)]i

=

8><

>:

xi

� � if xi

> �

0 if � � xi

xi

+ � if xi

< ��

8

Sλ(x) = for all i

We typically apply this to coefficient vector β.

Page 27: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

Iterative Shrinkage-Thresholding Algorithm (ISTA)

•  Initialize β; let η be learning rate •  Repeat until convergence

– Make a gradient step of: β ç Sλ(β – ηXT(Xβ – y))

Page 28: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

Coordinate Descent •  Fastest current method for lasso-penalized linear or

logistic regression

•  Simple idea: adjust one feature at a time, and special-case it near 0 where gradient not defined (where absolute value’s effect changes)

•  Can take features in a cycle in any order, or randomly pick next feature (analogous to Gibbs Sampling)

•  To “special-case it near 0” just apply soft-thresholding everywhere

Page 29: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

Coordinate Descent Algorithm •  Initialize coefficients •  Cycle over features until convergence:

–  For each example i and feature j, compute “partial residual”:

–  Compute least-squares coefficients of these residuals (as we did in OLS regression):

–  Update βj by soft-thresholding, where for any term T, “T+” denotes min(0,A):

βj ç Sλ(β )

useR! 2009 Trevor Hastie, Stanford Statistics 18

Coordinate descent for the lasso

min!1

2N

!Ni=1(yi !

!pj=1 xij!j)2 + "

!pj=1 |!j |

Suppose the p predictors and response are standardized to havemean zero and variance 1. Initialize all the !j = 0.

Cycle over j = 1, 2, . . . , p, 1, 2, . . . till convergence:

• Compute the partial residuals rij = yi !!

k !=j xik!k.

• Compute the simple least squares coe!cient of these residualson jth predictor: !"

j = 1N

!Ni=1 xijrij

• Update !j by soft-thresholding:

!j " S(!"j , ")

= sign(!"j )(|!"

j |! ")+

(0,0)

"

useR! 2009 Trevor Hastie, Stanford Statistics 18

Coordinate descent for the lasso

min!1

2N

!Ni=1(yi !

!pj=1 xij!j)2 + "

!pj=1 |!j |

Suppose the p predictors and response are standardized to havemean zero and variance 1. Initialize all the !j = 0.

Cycle over j = 1, 2, . . . , p, 1, 2, . . . till convergence:

• Compute the partial residuals rij = yi !!

k !=j xik!k.

• Compute the simple least squares coe!cient of these residualson jth predictor: !"

j = 1N

!Ni=1 xijrij

• Update !j by soft-thresholding:

!j " S(!"j , ")

= sign(!"j )(|!"

j |! ")+

(0,0)

"

* j

Page 30: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

Comments on penalized regression

•  L2-penalized regression also called “ridge regression”

•  Can combine L1 and L2 penalties: “elastic net”

•  L1-penalized regression is especially active area of research

–  group lasso –  fused lasso –  others

Page 31: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

Comments on logistic regression  

•  Logis9c  Regression  is  a  linear  classifier  

•  In  Bayesian  classifier,  features  are  independent  given  class  -­‐>  assump9on  on  P(X|Y)  

•  In  Logis9c  Regression:  Func9onal  form  is  P(Y|X),  there  is  no  assump9on  on  P(X|Y)  

•  Logis9c  Regression  op9mized  by  using  condi9onal  likelihood  

•  no  closed-­‐form  solu9on  •  concave  -­‐>  global  op9mum  with  gradient  ascent  

Page 32: Regression - University of Wisconsin–Madisonpages.cs.wisc.edu/~dpage/cs760/Regression.pdf · • linear regression • RMSE, MAE, and R-square • logistic regression • convex

•  Linear  and  logis9c  regression  prone  to  overfi\ng  

•  Regulariza9on  helps  combat  overfi\ng  by  adding  a  penalty  term  to  the  target  func9on  being  op9mized  

•  L1  regulariza9on  ocen  preferred  since  it  produces  sparse  models.  It  can  drive  certain  co-­‐efficients(weights)  to  zero,  performing  feature  selec9on  in  effect  

•  L2  regulariza9on  drives  towards  smaller  and  simpler  weight  vectors  but  cannot  perform  feature  selec9on  like  L1  regulariza9on  

•  Few  uses  of  OLS  these  days…  e.g.,  Warfarin  Dosing  (NEJM  2009)…  just  30  carefully  hand-­‐selected  features  

   

More comments on regularization