50
1/50 Chapter 2 - linear methods Linear regression Logistic regression Geir Storvik January 25, 2021 Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 1 / 50

Chapter 2 - linear methods Linear regression Logistic regression2/50 Lectures Course web-page: 3 hours lectures Schedule: 4 hours lectures New plan: Only lectures 14.15-15.00 on Wednesdays

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

  • 1/50

    Chapter 2 - linear methodsLinear regression

    Logistic regression

    Geir Storvik

    January 25, 2021

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 1 / 50

  • 2/50

    Lectures

    Course web-page: 3 hours lectures

    Schedule: 4 hours lectures

    New plan: Only lectures 14.15-15.00 on Wednesdays

    We might use the extra hour later!

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 2 / 50

  • 3/50

    Linear regression

    What is linear regression?Some repetition from STK1110, see chap 12 in Devore & Berk

    Properties, what can be done with the linear model?Challenges/weaknesses

    Many of these common with other methods

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 3 / 50

  • 4/50

    Prediction - Advertising data

    Response: Sale of a product in 200different market (sales)Explanatory variables

    Advertisement budget in tv (TV)Advertisement budget in radio(radio)Advertisement budget in newspapers(newspaper)

    0 50 100 200 300

    510

    15

    20

    25

    TV

    Sale

    s

    0 10 20 30 40 50

    510

    15

    20

    25

    Radio

    Sale

    s

    0 20 40 60 80 100

    510

    15

    20

    25

    Newspaper

    Sale

    s

    1Some of the figures are taken from "An Introduction to Statistical Learning,

    with applications in R" (Springer, 2013) with permission from the authors: G.

    James, D. Witten, T. Hastie and R. Tibshirani

    Questions:

    Is there a relationship between advertisment and sales?

    How strong is this relationsip?

    Which media gives strongest influence?

    How precise can we estimate the effect?

    How precise can we predict future sale?

    Is there a linear relationship?

    Is there some synergy/interaction between different media?

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 4 / 50

  • 5/50

    Linear regression

    Data (x1, y1), ..., (xn, yn)Model: Assume

    Yi = β0 + β1xi,1 + · · ·+ βpxi,p + εi , , εiind∼ (0, σ2) (*)

    Matrix form:y1y2...

    yn

    =

    1 x1,1 x1,2 · · · x1,p1 x2,1 x2,2 · · · x2,p...1 xn,1 xn,2 · · · xn,p

    β0β1...βp

    +ε1ε2...εn

    Y =Xβ + ε

    Least squares estimates (also ML if εiiid∼ N(0, σ2))

    β̂ =(XT X)−1XT Y

    Prediction on new point x∗ = (x∗1 , ..., x∗p ):

    ŷ∗ =β̂0 + β̂1x∗1 + · · ·+ β̂px∗p

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 5 / 50

  • 6/50

    Random vectors

    If z1, ..., zp are random variables, we say z = (z1, ..., zp) is a random vectorWe define expectation and covariance matrix by

    E [z] =

    E [z1]E [z2]

    ...E [zp]

    , V [z] =

    Var[z1] Cov[z1, z2] · · · Cov[z1, zp]Cov[z2, z1] Var[z2] · · · Cov[z2, zp]

    ......

    . . ....

    Cov[zp, z1] Cov[zp, z2] · · · Var[zp]

    Rules

    E [Az + b] =AE [z] + b,

    V [Az + b] =AV [z]AT

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 6 / 50

  • 7/50

    Properties - linear regression

    Estimate: β̂ = (XT X)−1XT YIf (*) is true,

    E[β̂] = β, V[β̂] = σ2(XT X)−1

    If also εiiid∼ N(0, σ2):

    Test H0 : βj = 0: T =β̂j

    SE(β̂j )

    H0∼ tn−p−1 under H0Test H0 : β1 = β2 = · · · = βp = 0

    F = (TSS−RSS)/pRSS/(n−p−1)H0∼ Fp,n−p−1 under H0

    RSS =n∑

    i=1

    (yi − ŷi )2 < TSS =n∑

    i=1

    (yi − ȳ)2

    Approximate true of n� p

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 7 / 50

  • 8/50

    Geometric interpretation

    β̂ = (XT X)−1XT Yŷi = xTi β̂, Ŷ = (ŷ1, ..., ŷn)

    Ŷ =X β̂ = X (XT X)−1XT︸ ︷︷ ︸P

    Y = PY P symmetric

    P2 =X (XT X)−1XT X (XT X)−1XT = P Projection matrix

    Ŷ − Y =(I − P)Y

    (Ŷ − Y )T Ŷ =Y (I − P)PY = 0 Orthogonality

    Y

    Y − Ŷ

    C(X)

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 8 / 50

  • 9/50

    Advertising data

    > Adve r t i s i ng f i t . lm summary ( f i t . lm )

    C o e f f i c i e n t s :Est imate Std . E r ro r t value Pr ( > | t | )

    ( I n t e r c e p t ) 2.938889 0.311908 9.422

  • 10/50

    Advertising data

    Clearly significant that at least one of the explanatory variables are useful forprediction of response.Are all explanatory variables important?

    Newspaper seems to be less important.f i t 2 . lm | t | )( I n t e r c e p t ) 2.92110 0.29449 9.919

  • 11/50

    Comparison of models

    Topic within chapter 3Here: Some simple approachesAssume we want to test

    H0 : βi1 = βi2 = · · · = βiq = 0

    Let RSS0 =∑n

    i=1(y − ŷi )2 where ŷi is computed under H0, RSS similarly for the fullmodel

    F =(RSS0 − RSS)/qRSS/(n − p − 1)

    H0∼ Fq,n−p−1

    Example: H0 : β3 = 0, q = 1> RSS RSS0 Fobs Fobs[ 1 ] 0.03122805> 1−pf ( Fobs ,1 ,196)[ 1 ] 0.8599151> anova ( f i t . lm , f i t 2 . lm )Ana lys is o f Variance Table

    Model 1 : Sales ~ TV + Radio + NewspaperModel 2 : Sales ~ TV + Radio

    Res . Df RSS Df Sum of Sq F Pr ( >F)1 196 556.832 197 556.91 −1 −0.088717 0.0312 0.8599

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 11 / 50

  • 12/50

    Two tests when q = 1

    Test H0 : βj = 0:

    T =β̂j

    SE(β̂j )

    H0∼ tn−p−1

    F = (RSS0−RSS)/qRSS/(n−p−q)H0∼ F1,n−p−1

    Same test, since F = T 2 and

    T ∼ tn−p−1 ⇒ T 2 ∼ F1,n−p−1

    Example: F = 0.03122805

    C o e f f i c i e n t s :Est imate Std . E r ro r t value Pr ( > | t | )

    ( I n t e r c e p t ) 2.938889 0.311908 9.422

  • 13/50

    Interactions

    Alternative model for Advertising data:

    Y = β0 + β1x1 + β2x2 + β3x1x2 + ε

    > f i t 3 . lm summary ( f i t 3 . lm )

    C o e f f i c i e n t s :Est imate Std . E r ro r t value Pr ( > | t | )

    ( I n t e r c e p t ) 6.750e+00 2.479e−01 27.233

  • 14/50

    What is linearity?

    Model with interactions

    Y = β0 + β1x1 + β2x2 + β3x1x2 + ε

    The model is not linear in the x ’s

    The model is linear in the β’s

    Theory about linear regression require linearity in the β’s

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 14 / 50

  • 15/50

    What if p is large?

    Example: HittersResponse: Salary, p = 19 explanatory variables> f i t . lm summary ( f i t . lm )C o e f f i c i e n t s :

    Est imate Std . E r ro r t value Pr ( > | t | )( I n t e r c e p t ) 163.10359 90.77854 1.797 0.073622 .AtBat −1.97987 0.63398 −3.123 0.002008 ∗∗Hi t s 7.50077 2.37753 3.155 0.001808 ∗∗HmRun 4.33088 6.20145 0.698 0.485616Runs −2.37621 2.98076 −0.797 0.426122RBI −1.04496 2.60088 −0.402 0.688204Walks 6.23129 1.82850 3.408 0.000766 ∗∗∗Years −3.48905 12.41219 −0.281 0.778874CAtBat −0.17134 0.13524 −1.267 0.206380CHits 0.13399 0.67455 0.199 0.842713CHmRun −0.17286 1.61724 −0.107 0.914967CRuns 1.45430 0.75046 1.938 0.053795 .CRBI 0.80771 0.69262 1.166 0.244691CWalks −0.81157 0.32808 −2.474 0.014057 ∗LeagueN 62.59942 79.26140 0.790 0.430424DivisionW −116.84925 40.36695 −2.895 0.004141 ∗∗PutOuts 0.28189 0.07744 3.640 0.000333 ∗∗∗Ass is t s 0.37107 0.22120 1.678 0.094723 .Er ro rs −3.36076 4.39163 −0.765 0.444857NewLeagueN −24.76233 79.00263 −0.313 0.754218−−−Residual standard e r r o r : 315.6 on 243 degrees o f freedom

    (59 observat ions de le ted due to missingness )M u l t i p l e R−squared : 0.5461 , Adjusted R−squared : 0.5106F−s t a t i s t i c : 15.39 on 19 and 243 DF, p−value : < 2.2e−16

    How to choose explanatory variables?

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 15 / 50

  • 16/50

    Variable selection

    The number of possibilities grows fast with pp = 3 gives 23 = 8 possible modelsp = 30 gir 230 = 1 073 741 824 possible models!

    Forward selectionStart with a null model Y = β0 + εAdd the value that gives the best improvementContinue as long as you obtain a significant improvement

    Backwards selectionStart with the full model Y = β0 + β1x1 + · · ·+ βpxp + εTake away the variable that gives smallest deteriorationContinue until you get a non-significant deterioration

    Mixed selectionCombination of forward and backwards selection

    We will come back to this in ch 3

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 16 / 50

  • 17/50

    Measure of performance

    Common choices:

    s2 =1

    n − p − 1

    n∑i=1

    (yi − ŷi )2 =1

    n − p − 1RSS RSS = D(β)

    R2 =TSS− RSS

    TSS= 1− RSS

    TSS

    =1−∑n

    i=1(yi − ŷi )2∑n

    i=1(yi − ȳ)2

    =

    ∑ni=1(yi − ȳ)(ŷi − ¯̂y)√∑n

    i=1(yi − ȳ)2∑n

    i=1(ŷi − ¯̂y)2

    Can show: 0 ≤ R2 ≤ 1R2 close to 1 indicate good performance

    Do not take into account overfitting. We will look at this later

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 17 / 50

  • 18/50

    Prediction

    β̂0, ..., β̂p gives prediction

    Ŷ = β̂0 + β̂1x1 + · · ·+ β̂pxp

    Approximation to assumed model

    f (x) = β0 + β1x1 + · · ·+ βpxp

    β0 + β1x1 + · · ·+ βpxp also an approximation to the true model f (x) = E[Y |x].Most theoritical results related to assuming is linear.

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 18 / 50

  • 19/50

    Confidence- og prediction intervals

    Confidence interval for βj : β̂j ± tα/2;n−p−1SE(β̂j )> c o n f i n t ( f i t 2 . lm )

    2.5 % 97.5 %( I n t e r c e p t ) 2.34034299 3.50185683TV 0.04301292 0.04849671Radio 0.17213877 0.20384969

    Confidence interval for E [Y |x] = xTβ: xT β̂ ± tα/2;n−p−1SE(xT β̂)newdata newdata predict ( f i t 2 . lm , newdata , i n t e r v a l = " p r e d i c t " )

    f i t lw r upr1 11.25647 7.929616 14.58332

    Based on that the assumed model is the true model.Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 19 / 50

  • 20/50

    Qualitative explanatory variables

    So far: Assumed explanatory variables are quantitativeExample: Credit data set

    Male Female

    050

    010

    0015

    0020

    00

    How to do regression with qualitative data?Assume first one explanatory variable with two categories.Define

    xi =

    {1 if individal i is female kvinne0 if individual i is male

    Assume model

    Yi =β0 + β1xi + εi

    =

    {β0 + β1 + εi if i is femaleβ0 + εi if i is male

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 20 / 50

  • 21/50

    Qualitative explanatory variables - cont.

    Example: Credit data set

    African American Asian Caucasian

    050

    010

    0015

    0020

    00

    Explanatory variable with three categories.Define

    xi1 =

    {1 if individual i is Asian0 if individual i is not Asian

    xi2 =

    {1 if individual i is Caucasian0 if individual i is not Caucasion

    Assume model

    Yi =β0 + β1xi1 + β2xi2 + εi

    =

    β0 + β1 + εi if i is Asianβ0 + β2 + εi if i is Caucasianβ0 + εi If i is African American

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 21 / 50

  • 22/50

    Regression with qualitative variable in R

    > class ( C red i t $Student )[ 1 ] " f a c t o r "> class ( C red i t $ E t h n i c i t y )[ 1 ] " f a c t o r "> f i t . lm summary ( f i t . lm )C o e f f i c i e n t s :

    Est imate Std . E r ro r t value Pr ( > | t | )( I n t e r c e p t ) 490.776 45.411 10.807 < 2e−16 ∗∗∗StudentYes 398.221 74.391 5.353 1.47e−07 ∗∗∗E t h n i c i t y A s i a n −29.216 62.899 −0.464 0.643Ethn ic i t yCaucas ian −6.297 54.817 −0.115 0.909

    Residual standard e r r o r : 445.6 on 396 degrees o f freedomM u l t i p l e R−squared : 0.06768 , Adjusted R−squared : 0.06062F−s t a t i s t i c : 9.583 on 3 and 396 DF, p−value : 4.025e−06

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 22 / 50

  • 23/50

    Quantitative and qualitative variable is R

    > f i t . lm2 summary ( f i t . lm2 )C o e f f i c i e n t s :

    Est imate Std . E r ro r t value Pr ( > | t | )( I n t e r c e p t ) −487.65045 35.22809 −13.843 < 2e−16 ∗∗∗Age −0.59933 0.29304 −2.045 0.0415 ∗Cards 18.06541 4.33008 4.172 3.72e−05 ∗∗∗Education −1.16552 1.59422 −0.731 0.4652Income −7.79950 0.23395 −33.338 < 2e−16 ∗∗∗L i m i t 0.19394 0.03258 5.953 5.86e−09 ∗∗∗Rating 1.08888 0.48785 2.232 0.0262 ∗StudentYes 426.10483 16.61371 25.648 < 2e−16 ∗∗∗E t h n i c i t y A s i a n 15.01876 14.00721 1.072 0.2843Ethn ic i t yCaucas ian 9.24342 12.17138 0.759 0.4480

    Residual standard e r r o r : 98.77 on 390 degrees o f freedomM u l t i p l e R−squared : 0.9549 , Adjusted R−squared : 0.9538F−s t a t i s t i c : 917.2 on 9 and 390 DF, p−value : < 2.2e−16

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 23 / 50

  • 24/50

    Extensions of the linear model

    Seen interactions earlier

    Y =β0 + β1x1 + β2x2 + β3x1x2 + ε

    =β0 + (β1 + β3x2)x1 + β3x2 + ε

    =β0 + β1X1 + (β2 + β3x1)x2 + ε

    Non-linear i x, linear in β!Variable selection with interactions:Hierarcical principle: If an interaction term is included, also include thecorresponding main effects even though they are not significant

    Gives easier interpretation of the model

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 24 / 50

  • 25/50

    Interaction between qualitative and quantitative variables

    Credit data: Want to predict balance from income (quantitative) og Student(qualitative)

    f i t . lm3

  • 26/50

    Potential problems

    Non-linearities

    Correlations between noise terms

    Non-constant variance of noise terms (heteroscedasticity)

    Outliers

    Observations with high infuence

    Colinearity

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 26 / 50

  • 27/50

    Non-linear relations

    Auto datasett

    50 100 150 200

    1020

    3040

    Horsepower

    Mile

    s pe

    r ga

    llon

    Reasonable with linear model?

    Alternative:Y = β0 + β1x + β2x2 + · · ·+ βqxq + εPlot: q = 1, 2, 5

    50 100 150 200

    10

    20

    30

    40

    50

    Horsepower

    Mile

    s p

    er

    gallo

    n

    Linear

    Degree 2

    Degree 5

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 27 / 50

  • 28/50

    Correlation between noise terms

    Standard assumption:

    Yi = β0 + β1xi1 + · · ·+ βpxip + εi , ε1, ..., εn independent

    What if there is dependence in the noise terms?

    β̂ = (XT X)−1XT yWe still have E[Y] = Xβ and thereby E [β̂] = βHowever V[β̂] 6= σ2(XT X)−1, typically larger variancesWill influence inference

    Necessary to change model

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 28 / 50

  • 29/50

    Non-constant variance

    Often assume V[ε] = σ2, the same for all data

    10 15 20 25 30

    −1

    0−

    50

    51

    01

    5

    Fitted values

    Re

    sid

    ua

    ls

    Response Y

    998

    975845

    2.4 2.6 2.8 3.0 3.2 3.4

    −0

    .8−

    0.6

    −0

    .4−

    0.2

    0.0

    0.2

    0.4

    Fitted values

    Re

    sid

    ua

    lsResponse log(Y)

    437671

    605

    Transformations can typically help!

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 29 / 50

  • 30/50

    Outliers

    An outlier is a y -value which is far from the predicted value.

    Easiest to identify by residual plots (linreg_outlier.R)

    2 4 6 8 10

    05

    1015

    2025

    3035

    x

    y

    All dataOutlier excluded

    2 4 6 8 10

    −5

    05

    10Index

    fit1$

    resi

    dual

    s

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 30 / 50

  • 31/50

    Observations with high influence

    Linear model Yi = β0 + β1xi1 + · · ·+ βpxip + εiLS estimate: β̂ = (XT X)−1XT yPrediction: Ŷ = Xβ̂ = X(XT X)−1XT Y = Pyŷi =

    ∑nj=1 Pijyj

    Pii says how much influence yi has on ŷiDo not want this influence to be too large (→ overfitting)Large infuence: Typical for x-values that are unusual

    −2 −1 0 1 2 3 4

    05

    10

    20

    41

    −2 −1 0 1 2

    −2

    −1

    01

    2

    0.00 0.05 0.10 0.15 0.20 0.25

    −1

    01

    23

    45

    Leverage

    Stu

    de

    ntize

    d R

    esid

    ua

    ls

    20

    41

    X

    Y

    X1

    X2

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 31 / 50

  • 32/50

    Colinearity - two variables

    Credit data: Some x-variables highly correlated

    2000 4000 6000 8000 12000

    30

    40

    50

    60

    70

    80

    Limit

    Ag

    e

    2000 4000 6000 8000 12000

    20

    04

    00

    60

    08

    00

    Limit

    Ra

    tin

    g

    > f i t 1 . lm summary ( f i t 1 . lm )C o e f f i c i e n t s :

    Est imate Std . E r ro r t value Pr ( > | t | )( I n t e r c e p t ) −2.928e+02 2.668e+01 −10.97 | t | )( I n t e r c e p t ) −377.53680 45.25418 −8.343 1.21e−15 ∗∗∗L i m i t 0.02451 0.06383 0.384 0.7012Rat ing 2.20167 0.95229 2.312 0.0213 ∗

    Can be identified through correlation matrices

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 32 / 50

  • 33/50

    Colinearity - several variables

    Colinearity between more that two variables is more problematicThe problem can be indentified through the Variance inflation factor (VIF), definedby

    VIF(β̂j ) =Vfull(β̂j )

    Vsingle(β̂j )

    where Vfull(β̂j ) is the variance of the estimate based on the model with allexplanatory variables andVsingle(β̂j ) is the variance to the estimate based only on xj as explanatory variableVIF(β̂j ) ≥ 1, low value indicate small colinearlty> f i t . f u l l l i b r a r y ( car )> v i f ( f i t . f u l l )

    GVIF Df GVIF^(1 / (2∗Df ) )X 1.030358 1 1.015066Income 2.787231 1 1.669500L i m i t 234.064316 1 15.299161Rat ing 235.887178 1 15.358619Cards 1.449767 1 1.204063Age 1.054739 1 1.027005Education 1.019588 1 1.009747Gender 1.019885 1 1.009894Student 1.032245 1 1.015994Marr ied 1.045300 1 1.022399E t h n i c i t y 1.040571 2 1.009992

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 33 / 50

  • 34/50

    Least squares

    Data {(x1, y1), ..., (xn, yn)}Least squares estimate

    β̂ = (X T X )−1︸ ︷︷ ︸p×p

    X T Y︸ ︷︷ ︸p×n

    Calculation of X T X =∑n

    i=1 x ixTi is O(np

    2)

    Inverting X T X is O(p3)Calculation of X T Y is O(np)Typically using a Gram-Schmidt procedure

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 34 / 50

  • 35/50

    Recursive methods

    Define W n =∑n

    i=1 x ixTi and V n = W

    −1n . We have

    W n+1 =W n + xn+1xTn+1

    V n+1 =V n − hnV nxn+1xTn+1V n Sherman-Morrison

    hn =1

    1 + xTn+1V nxn+1

    and

    β̂n+1 =β̂n + k n (yn+1 − xTn+1β̂n)︸ ︷︷ ︸Prediction error

    k n =hnV nxn+1

    Sum of squares Qn =∑n

    i=1(yi − ŷi )2:

    Qn+1 = Qn + hn(yn+1 − xTn+1β̂n)2

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 35 / 50

  • 36/50

    Least squares and maximum likelihood

    Assume now

    yi = xTi β + εi , εiiid∼ N(0, σ2)

    Likelihood=density for observation:

    L(θ) =f (y |θ)

    ind=

    n∏i=1

    f (yi |θ)

    =n∏

    i=1

    1√2πσ

    exp

    (− 1

    2σ2(yi − xTi β

    )2

    =1

    (2π)n/2σnexp

    (− 1

    2σ2

    n∑i=1

    (yi − xTi β)2)

    Maximum likelihood principle: θ̂ = arg maxθ L(θ)Maximization with respect to β is equivalent to minimizing

    D(β) =n∑

    i=1

    (yi − xTi β)2

    that is least squares is equivalent to maximum likelihood for Gaussian noiseBonus: Estimate for σ2:

    σ̂2 =1n

    n∑i=1

    (yi − xTi β̂)2

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 36 / 50

  • 37/50

    Advantages with maximum likelihood

    Possible with other models for noise1 t distribution allowing for some large noise terms

    Possible for other outcomesGamma distribution for positive continuous dataBinomial distribution for binary dataPoisson distribution for count data

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 37 / 50

  • 38/50

    General ML theory

    Maximum likelihood principle:

    θ̂ = arg maxθ

    L(θ) = arg maxθ

    log L(θ)

    log Lθ) = log L(θ)

    Typically found as the solution of

    ∂θlog Lθ) = 0

    Note: L involves products, log L involves sums. Easier with derivatives on the latterWill only give guaranty to local maximumAs n→∞, log L(θ) will in many cases converge towards a convex function

    Global maximum can be obtainedImportant quantity:

    J (θ̂) =− ∂2

    ∂θ∂θTlog L(θ)|θ=θ̂ Fisher’s observed information matrix

    θ̂ ≈N(θ,J (θ̂)−1)

    Note: Exact results available for linear regression!Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 38 / 50

  • 39/50

    ML and inference

    Confidence intervals: θ̂r ± zα/2std.err(θ̂r )Testing

    H0 : θr =a

    t =θ̂r − a

    std.err(θ̂r )

    H0≈ N(0, 1)

    P-value ≈2Φ(−|t |)

    Testing H0 : gj (θ) = 0, j = 1, ..., qθ̂0 is ML estimate under H0

    H0 : θr =a

    w = D =2[log L(θ̂)− log L(θ̂)]H0≈ χ2q Likelihood ratio statistics

    P-value ≈Pr(χ2q > w)

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 39 / 50

  • 40/50

    Binary variables

    Assume y ∼ Binom(n, π)

    log L(π) =constant + y log(π) + (n − y) log(1− π)∂

    ∂πlog L(π) =

    yπ− n − y

    1− π

    π̂ML =yn

    ∂2

    ∂π2log L(π) =− y

    π2− n − y

    (1− π)2

    ∂2

    ∂π2log L(π̂) =− n

    π̂− n

    (1− π̂) = −n

    π̂(1− π̂)

    Var[π̂] ≈ π̂(1− π̂)n

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 40 / 50

  • 41/50

    Binary variables - two groups

    Assume yj ∼ Binom(nj , πj ), j = 1, 2Test: H0 : π1 = π2 = π

    Under H0 : π̂ = (y1 + y2)/(n1 + n2)

    Under alternative: π̂j = yj/nj

    log L(π1, π2) =constant +n∑

    j=1

    y [log(πj ) + (nj − yj ) log(1− πj )]

    D =2[log L(π̂1, π̂2)− log L(π̂, π̂)] Deviance=D0 − D1

    D0 =− 2 log L(π̂, π̂)D1 =− 2 log L(π̂1, π̂2)

    H0 :D ≈ χ21

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 41 / 50

  • 42/50

    Example

    data from Brazilian bank

    Response: Satisfaction: low/high

    Groups: young/old

    satisfaction\ group young old totallow 84 34 118high 225 157 382total 309 191 500π̂ 0.729 0.822 0.764

    std.err(π̂) 0.025 0.026 0.019

    D = 5.96, P-value 0.015

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 42 / 50

  • 43/50

    Logistic regression

    Assume yi ∼ Binom(1, πi ), i = 1, ..., nWant to make πi = π(x i ) for some explanatory variables x iLinear relation

    η(x) = β0 + β1x1 + · · ·+ βp−1xp−1

    Logistic regression

    π(x) =exp(η(x))

    1 + exp(η(x))logistic function

    =1

    1 + exp(−η(x)) sigmoid function

    =sigmoid(η(x))

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 43 / 50

  • 44/50

    Logistic function

    −4 −2 0 2

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    x

    Logi

    stic

    func

    tion

    2−2x2+3x0+x

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 44 / 50

  • 45/50

    Brazilian data

    Earlier: Compared two groups, young or old

    Age is a numberic variable, could be used as an explanatory variable directly

    Brazilian_logist_reg.R

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 45 / 50

  • 46/50

    Generalized linear models

    Logistic regression:

    Y ∼Binom(n, π(x))

    E [Y |x ] =π(x) = eη(x)

    1 + eη(x)

    g(π) = log(

    π

    1− π

    )g(E [Y |x ]) =η(x) = xTβ

    Special case of

    Y ∼f (µ(x)) f in exponential family

    η(x) =g(µ(x)) = xTβ

    E [Y |x ] =µ(x) = g−1(η(x))

    Generalized linear modelsInclude Linear regression, Logistic regression, Poisson regression, ...Theme in STK3100

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 46 / 50

  • 47/50

    K -nearest neighbor regression

    linear regression:Simple to fitSimple interpretationSimple to perform different testsStrong assumptions on model

    Non-parametric methods (machine learning)Do not assume any explicit formAK -nearest neighbor method:

    f̂ (x0) =1K

    ∑x i∈N0

    yi

    where N0 ⊂ {x1, ..., xn} contain the K nearest points to x0.

    yy

    x1x1

    x 2x 2

    Choise of K : Trade-off between bias and varianceGeir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 47 / 50

  • 48/50

    Parametric or non-parametric?

    If parametric model close to true: Parametric method is better (smaller varians,small bias)

    −1.0 −0.5 0.0 0.5 1.0

    12

    34

    −1.0 −0.5 0.0 0.5 1.0

    12

    34

    yy

    xx

    −1.0 −0.5 0.0 0.5 1.0

    12

    34

    0.2 0.5 1.0

    0.0

    00

    .05

    0.1

    00

    .15

    Me

    an

    Sq

    ua

    red

    Err

    or

    y

    x 1/K

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 48 / 50

  • 49/50

    Parametric or non-parametric?

    If parametric model is very wrong: Non-parametric method better (smaller bias)

    −1.0 −0.5 0.0 0.5 1.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    0.2 0.5 1.0

    0.0

    00.0

    20.0

    40.0

    60.0

    8

    Mean S

    quare

    d E

    rror

    −1.0 −0.5 0.0 0.5 1.0

    1.0

    1.5

    2.0

    2.5

    3.0

    3.5

    0.2 0.5 1.0

    0.0

    00.0

    50.1

    00.1

    5

    Mean S

    quare

    d E

    rror

    yy

    x

    x

    1/K

    1/K

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 49 / 50

  • 50/50

    Parametric or non-parametric - high dimension

    For one explanatory variable: Many observations can give good results fornon-parametric methods

    For many explanatory variables: The K nearest xi ’s to x0 will typically be far awayfrom x0.Gives large bias!

    Based on f (x0) ≈ f (x i ) for x i close to x0.

    0.2 0.5 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    p=1

    0.2 0.5 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    p=2

    0.2 0.5 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    p=3

    0.2 0.5 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    p=4

    0.2 0.5 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    p=10

    0.2 0.5 1.00

    .00

    .20

    .40

    .60

    .81

    .0

    p=20

    Me

    an

    Sq

    ua

    red

    Err

    or

    1/K

    Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 50 / 50