Chapter 2 - linear methods Linear regression Logistic regression2/50 Lectures Course web-page: 3 hours lectures Schedule: 4 hours lectures New plan: Only lectures 14.15-15.00 on Wednesdays

1/50

Chapter 2 - linear methodsLinear regression

Logistic regression

Geir Storvik

January 25, 2021

Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 1 / 50

2/50

Lectures

Course web-page: 3 hours lectures

Schedule: 4 hours lectures

New plan: Only lectures 14.15-15.00 on Wednesdays

We might use the extra hour later!


3/50

Linear regression

What is linear regression?Some repetition from STK1110, see chap 12 in Devore & Berk

Properties, what can be done with the linear model?Challenges/weaknesses

Many of these common with other methods


4/50

Prediction - Advertising data

Response: Sale of a product in 200different market (sales)Explanatory variables

Advertisement budget in tv (TV)Advertisement budget in radio(radio)Advertisement budget in newspapers(newspaper)

0 50 100 200 300

510

15

20

25

TV

Sale

s

0 10 20 30 40 50

510

15

20

25

Radio

Sale

s

0 20 40 60 80 100

510

15

20

25

Newspaper

Sale

s

1Some of the figures are taken from "An Introduction to Statistical Learning,

with applications in R" (Springer, 2013) with permission from the authors: G.

James, D. Witten, T. Hastie and R. Tibshirani

Questions:

Is there a relationship between advertisment and sales?

How strong is this relationsip?

Which media gives strongest influence?

How precise can we estimate the effect?

How precise can we predict future sale?

Is there a linear relationship?

Is there some synergy/interaction between different media?


5/50

Linear regression

Data (x1, y1), ..., (xn, yn)Model: Assume

Yi = β0 + β1xi,1 + · · ·+ βpxi,p + εi , , εiind∼ (0, σ2) (*)

Matrix form:y1y2...

yn

=

1 x1,1 x1,2 · · · x1,p1 x2,1 x2,2 · · · x2,p...1 xn,1 xn,2 · · · xn,p

β0β1...βp

+ε1ε2...εn

Y =Xβ + ε

Least squares estimates (also ML if εiiid∼ N(0, σ2))

β̂ =(XT X)−1XT Y

Prediction on new point x∗ = (x∗1 , ..., x∗p ):

ŷ∗ =β̂0 + β̂1x∗1 + · · ·+ β̂px∗p


6/50

Random vectors

If z1, ..., zp are random variables, we say z = (z1, ..., zp) is a random vectorWe define expectation and covariance matrix by

E [z] =

E [z1]E [z2]

...E [zp]

, V [z] =

Var[z1] Cov[z1, z2] · · · Cov[z1, zp]Cov[z2, z1] Var[z2] · · · Cov[z2, zp]

......

. . ....

Cov[zp, z1] Cov[zp, z2] · · · Var[zp]

Rules

E [Az + b] =AE [z] + b,

V [Az + b] =AV [z]AT


7/50

Properties - linear regression

Estimate: β̂ = (XT X)−1XT YIf (*) is true,

E[β̂] = β, V[β̂] = σ2(XT X)−1

If also εiiid∼ N(0, σ2):

Test H0 : βj = 0: T =β̂j

SE(β̂j )

H0∼ tn−p−1 under H0Test H0 : β1 = β2 = · · · = βp = 0

F = (TSS−RSS)/pRSS/(n−p−1)H0∼ Fp,n−p−1 under H0

RSS =n∑

i=1

(yi − ŷi )2 < TSS =n∑

i=1

(yi − ȳ)2

Approximate true of n� p


8/50

Geometric interpretation

β̂ = (XT X)−1XT Yŷi = xTi β̂, Ŷ = (ŷ1, ..., ŷn)

Ŷ =X β̂ = X (XT X)−1XT︸︷︷︸P

Y = PY P symmetric

P2 =X (XT X)−1XT X (XT X)−1XT = P Projection matrix

Ŷ − Y =(I − P)Y

(Ŷ − Y )T Ŷ =Y (I − P)PY = 0 Orthogonality

Y

Ŷ

Y − Ŷ

C(X)


9/50

Advertising data

> Adve r t i s i ng f i t . lm summary ( f i t . lm )

C o e f f i c i e n t s :Est imate Std . E r ro r t value Pr ( > | t | )

( I n t e r c e p t ) 2.938889 0.311908 9.422

10/50

Advertising data

Clearly significant that at least one of the explanatory variables are useful forprediction of response.Are all explanatory variables important?

Newspaper seems to be less important.f i t 2 . lm | t | )( I n t e r c e p t ) 2.92110 0.29449 9.919

11/50

Comparison of models

Topic within chapter 3Here: Some simple approachesAssume we want to test

H0 : βi1 = βi2 = · · · = βiq = 0

Let RSS0 =∑n

i=1(y − ŷi )2 where ŷi is computed under H0, RSS similarly for the fullmodel

F =(RSS0 − RSS)/qRSS/(n − p − 1)

H0∼ Fq,n−p−1

Example: H0 : β3 = 0, q = 1> RSS RSS0 Fobs Fobs[ 1 ] 0.03122805> 1−pf ( Fobs ,1 ,196)[ 1 ] 0.8599151> anova ( f i t . lm , f i t 2 . lm )Ana lys is o f Variance Table

Model 1 : Sales ~ TV + Radio + NewspaperModel 2 : Sales ~ TV + Radio

Res . Df RSS Df Sum of Sq F Pr ( >F)1 196 556.832 197 556.91 −1 −0.088717 0.0312 0.8599


12/50

Two tests when q = 1

Test H0 : βj = 0:

T =β̂j

SE(β̂j )

H0∼ tn−p−1

F = (RSS0−RSS)/qRSS/(n−p−q)H0∼ F1,n−p−1

Same test, since F = T 2 and

T ∼ tn−p−1 ⇒ T 2 ∼ F1,n−p−1

Example: F = 0.03122805


( I n t e r c e p t ) 2.938889 0.311908 9.422

13/50

Interactions

Alternative model for Advertising data:

Y = β0 + β1x1 + β2x2 + β3x1x2 + ε

> f i t 3 . lm summary ( f i t 3 . lm )


( I n t e r c e p t ) 6.750e+00 2.479e−01 27.233

14/50

What is linearity?

Model with interactions

Y = β0 + β1x1 + β2x2 + β3x1x2 + ε

The model is not linear in the x ’s

The model is linear in the β’s

Theory about linear regression require linearity in the β’s


15/50

What if p is large?

Example: HittersResponse: Salary, p = 19 explanatory variables> f i t . lm summary ( f i t . lm )C o e f f i c i e n t s :

Est imate Std . E r ro r t value Pr ( > | t | )( I n t e r c e p t ) 163.10359 90.77854 1.797 0.073622 .AtBat −1.97987 0.63398 −3.123 0.002008 ∗∗Hi t s 7.50077 2.37753 3.155 0.001808 ∗∗HmRun 4.33088 6.20145 0.698 0.485616Runs −2.37621 2.98076 −0.797 0.426122RBI −1.04496 2.60088 −0.402 0.688204Walks 6.23129 1.82850 3.408 0.000766 ∗∗∗Years −3.48905 12.41219 −0.281 0.778874CAtBat −0.17134 0.13524 −1.267 0.206380CHits 0.13399 0.67455 0.199 0.842713CHmRun −0.17286 1.61724 −0.107 0.914967CRuns 1.45430 0.75046 1.938 0.053795 .CRBI 0.80771 0.69262 1.166 0.244691CWalks −0.81157 0.32808 −2.474 0.014057 ∗LeagueN 62.59942 79.26140 0.790 0.430424DivisionW −116.84925 40.36695 −2.895 0.004141 ∗∗PutOuts 0.28189 0.07744 3.640 0.000333 ∗∗∗Ass is t s 0.37107 0.22120 1.678 0.094723 .Er ro rs −3.36076 4.39163 −0.765 0.444857NewLeagueN −24.76233 79.00263 −0.313 0.754218−−−Residual standard e r r o r : 315.6 on 243 degrees o f freedom

(59 observat ions de le ted due to missingness )M u l t i p l e R−squared : 0.5461 , Adjusted R−squared : 0.5106F−s t a t i s t i c : 15.39 on 19 and 243 DF, p−value : < 2.2e−16

How to choose explanatory variables?


16/50

Variable selection

The number of possibilities grows fast with pp = 3 gives 23 = 8 possible modelsp = 30 gir 230 = 1 073 741 824 possible models!

Forward selectionStart with a null model Y = β0 + εAdd the value that gives the best improvementContinue as long as you obtain a significant improvement

Backwards selectionStart with the full model Y = β0 + β1x1 + · · ·+ βpxp + εTake away the variable that gives smallest deteriorationContinue until you get a non-significant deterioration

Mixed selectionCombination of forward and backwards selection

We will come back to this in ch 3


17/50

Measure of performance

Common choices:

s2 =1

n − p − 1

n∑i=1

(yi − ŷi )2 =1

n − p − 1RSS RSS = D(β)

R2 =TSS− RSS

TSS= 1− RSS

TSS

=1−∑n

i=1(yi − ŷi )2∑n

i=1(yi − ȳ)2

=

∑ni=1(yi − ȳ)(ŷi − ¯̂y)√∑n

i=1(yi − ȳ)2∑n

i=1(ŷi − ¯̂y)2

Can show: 0 ≤ R2 ≤ 1R2 close to 1 indicate good performance

Do not take into account overfitting. We will look at this later


18/50

Prediction

β̂0, ..., β̂p gives prediction

Ŷ = β̂0 + β̂1x1 + · · ·+ β̂pxp

Approximation to assumed model

f (x) = β0 + β1x1 + · · ·+ βpxp

β0 + β1x1 + · · ·+ βpxp also an approximation to the true model f (x) = E[Y |x].Most theoritical results related to assuming is linear.


19/50

Confidence- og prediction intervals

Confidence interval for βj : β̂j ± tα/2;n−p−1SE(β̂j )> c o n f i n t ( f i t 2 . lm )

2.5 % 97.5 %( I n t e r c e p t ) 2.34034299 3.50185683TV 0.04301292 0.04849671Radio 0.17213877 0.20384969

Confidence interval for E [Y |x] = xTβ: xT β̂ ± tα/2;n−p−1SE(xT β̂)newdata newdata predict ( f i t 2 . lm , newdata , i n t e r v a l = " p r e d i c t " )

f i t lw r upr1 11.25647 7.929616 14.58332

Based on that the assumed model is the true model.Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 19 / 50

20/50

Qualitative explanatory variables

So far: Assumed explanatory variables are quantitativeExample: Credit data set

Male Female

050

010

0015

0020

00

How to do regression with qualitative data?Assume first one explanatory variable with two categories.Define

xi =

{1 if individal i is female kvinne0 if individual i is male

Assume model

Yi =β0 + β1xi + εi

=

{β0 + β1 + εi if i is femaleβ0 + εi if i is male


21/50

Qualitative explanatory variables - cont.

Example: Credit data set

African American Asian Caucasian

050

010

0015

0020

00

Explanatory variable with three categories.Define

xi1 =

{1 if individual i is Asian0 if individual i is not Asian

xi2 =

{1 if individual i is Caucasian0 if individual i is not Caucasion

Assume model

Yi =β0 + β1xi1 + β2xi2 + εi

=

β0 + β1 + εi if i is Asianβ0 + β2 + εi if i is Caucasianβ0 + εi If i is African American


22/50

Regression with qualitative variable in R

> class ( C red i t $Student )[ 1 ] " f a c t o r "> class ( C red i t $ E t h n i c i t y )[ 1 ] " f a c t o r "> f i t . lm summary ( f i t . lm )C o e f f i c i e n t s :

Est imate Std . E r ro r t value Pr ( > | t | )( I n t e r c e p t ) 490.776 45.411 10.807 < 2e−16 ∗∗∗StudentYes 398.221 74.391 5.353 1.47e−07 ∗∗∗E t h n i c i t y A s i a n −29.216 62.899 −0.464 0.643Ethn ic i t yCaucas ian −6.297 54.817 −0.115 0.909

Residual standard e r r o r : 445.6 on 396 degrees o f freedomM u l t i p l e R−squared : 0.06768 , Adjusted R−squared : 0.06062F−s t a t i s t i c : 9.583 on 3 and 396 DF, p−value : 4.025e−06


23/50

Quantitative and qualitative variable is R

> f i t . lm2 summary ( f i t . lm2 )C o e f f i c i e n t s :

Est imate Std . E r ro r t value Pr ( > | t | )( I n t e r c e p t ) −487.65045 35.22809 −13.843 < 2e−16 ∗∗∗Age −0.59933 0.29304 −2.045 0.0415 ∗Cards 18.06541 4.33008 4.172 3.72e−05 ∗∗∗Education −1.16552 1.59422 −0.731 0.4652Income −7.79950 0.23395 −33.338 < 2e−16 ∗∗∗L i m i t 0.19394 0.03258 5.953 5.86e−09 ∗∗∗Rating 1.08888 0.48785 2.232 0.0262 ∗StudentYes 426.10483 16.61371 25.648 < 2e−16 ∗∗∗E t h n i c i t y A s i a n 15.01876 14.00721 1.072 0.2843Ethn ic i t yCaucas ian 9.24342 12.17138 0.759 0.4480

Residual standard e r r o r : 98.77 on 390 degrees o f freedomM u l t i p l e R−squared : 0.9549 , Adjusted R−squared : 0.9538F−s t a t i s t i c : 917.2 on 9 and 390 DF, p−value : < 2.2e−16


24/50

Extensions of the linear model

Seen interactions earlier

Y =β0 + β1x1 + β2x2 + β3x1x2 + ε

=β0 + (β1 + β3x2)x1 + β3x2 + ε

=β0 + β1X1 + (β2 + β3x1)x2 + ε

Non-linear i x, linear in β!Variable selection with interactions:Hierarcical principle: If an interaction term is included, also include thecorresponding main effects even though they are not significant

Gives easier interpretation of the model


25/50

Interaction between qualitative and quantitative variables

Credit data: Want to predict balance from income (quantitative) og Student(qualitative)

f i t . lm3

26/50

Potential problems

Non-linearities

Correlations between noise terms

Non-constant variance of noise terms (heteroscedasticity)

Outliers

Observations with high infuence

Colinearity


27/50

Non-linear relations

Auto datasett

50 100 150 200

1020

3040

Horsepower

Mile

s pe

r ga

llon

Reasonable with linear model?

Alternative:Y = β0 + β1x + β2x2 + · · ·+ βqxq + εPlot: q = 1, 2, 5

50 100 150 200

10

20

30

40

50

Horsepower

Mile

s p

er

gallo

n

Linear

Degree 2

Degree 5


28/50

Correlation between noise terms

Standard assumption:

Yi = β0 + β1xi1 + · · ·+ βpxip + εi , ε1, ..., εn independent

What if there is dependence in the noise terms?

β̂ = (XT X)−1XT yWe still have E[Y] = Xβ and thereby E [β̂] = βHowever V[β̂] 6= σ2(XT X)−1, typically larger variancesWill influence inference

Necessary to change model


29/50

Non-constant variance

Often assume V[ε] = σ2, the same for all data

10 15 20 25 30

−1

0−

50

51

01

5

Fitted values

Re

sid

ua

ls

Response Y

998

975845

2.4 2.6 2.8 3.0 3.2 3.4

−0

.8−

0.6

−0

.4−

0.2

0.0

0.2

0.4

Fitted values

Re

sid

ua

lsResponse log(Y)

437671

605

Transformations can typically help!


30/50

Outliers

An outlier is a y -value which is far from the predicted value.

Easiest to identify by residual plots (linreg_outlier.R)

2 4 6 8 10

05

1015

2025

3035

x

y

All dataOutlier excluded

2 4 6 8 10

−5

05

10Index

fit1$

resi

dual

s


31/50

Observations with high influence

Linear model Yi = β0 + β1xi1 + · · ·+ βpxip + εiLS estimate: β̂ = (XT X)−1XT yPrediction: Ŷ = Xβ̂ = X(XT X)−1XT Y = Pyŷi =

∑nj=1 Pijyj

Pii says how much influence yi has on ŷiDo not want this influence to be too large (→ overfitting)Large infuence: Typical for x-values that are unusual

−2 −1 0 1 2 3 4

05

10

20

41

−2 −1 0 1 2

−2

−1

01

2

0.00 0.05 0.10 0.15 0.20 0.25

−1

01

23

45

Leverage

Stu

de

ntize

d R

esid

ua

ls

20

41

X

Y

X1

X2


32/50

Colinearity - two variables

Credit data: Some x-variables highly correlated

2000 4000 6000 8000 12000

30

40

50

60

70

80

Limit

Ag

e

2000 4000 6000 8000 12000

20

04

00

60

08

00

Limit

Ra

tin

g

> f i t 1 . lm summary ( f i t 1 . lm )C o e f f i c i e n t s :

Est imate Std . E r ro r t value Pr ( > | t | )( I n t e r c e p t ) −2.928e+02 2.668e+01 −10.97 | t | )( I n t e r c e p t ) −377.53680 45.25418 −8.343 1.21e−15 ∗∗∗L i m i t 0.02451 0.06383 0.384 0.7012Rat ing 2.20167 0.95229 2.312 0.0213 ∗

Can be identified through correlation matrices


33/50

Colinearity - several variables

Colinearity between more that two variables is more problematicThe problem can be indentified through the Variance inflation factor (VIF), definedby

VIF(β̂j ) =Vfull(β̂j )

Vsingle(β̂j )

where Vfull(β̂j ) is the variance of the estimate based on the model with allexplanatory variables andVsingle(β̂j ) is the variance to the estimate based only on xj as explanatory variableVIF(β̂j ) ≥ 1, low value indicate small colinearlty> f i t . f u l l l i b r a r y ( car )> v i f ( f i t . f u l l )

GVIF Df GVIF^(1 / (2∗Df ) )X 1.030358 1 1.015066Income 2.787231 1 1.669500L i m i t 234.064316 1 15.299161Rat ing 235.887178 1 15.358619Cards 1.449767 1 1.204063Age 1.054739 1 1.027005Education 1.019588 1 1.009747Gender 1.019885 1 1.009894Student 1.032245 1 1.015994Marr ied 1.045300 1 1.022399E t h n i c i t y 1.040571 2 1.009992


34/50

Least squares

Data {(x1, y1), ..., (xn, yn)}Least squares estimate

β̂ = (X T X )−1︸︷︷︸p×p

X T Y︸︷︷︸p×n

Calculation of X T X =∑n

i=1 x ixTi is O(np

2)

Inverting X T X is O(p3)Calculation of X T Y is O(np)Typically using a Gram-Schmidt procedure


35/50

Recursive methods

Define W n =∑n

i=1 x ixTi and V n = W

−1n . We have

W n+1 =W n + xn+1xTn+1

V n+1 =V n − hnV nxn+1xTn+1V n Sherman-Morrison

hn =1

1 + xTn+1V nxn+1

and

β̂n+1 =β̂n + k n (yn+1 − xTn+1β̂n)︸︷︷︸Prediction error

k n =hnV nxn+1

Sum of squares Qn =∑n

i=1(yi − ŷi )2:

Qn+1 = Qn + hn(yn+1 − xTn+1β̂n)2


36/50

Least squares and maximum likelihood

Assume now

yi = xTi β + εi , εiiid∼ N(0, σ2)

Likelihood=density for observation:

L(θ) =f (y |θ)

ind=

n∏i=1

f (yi |θ)

=n∏

i=1

1√2πσ

exp

(− 1

2σ2(yi − xTi β

)2

=1

(2π)n/2σnexp

(− 1

2σ2

n∑i=1

(yi − xTi β)2)

Maximum likelihood principle: θ̂ = arg maxθ L(θ)Maximization with respect to β is equivalent to minimizing

D(β) =n∑

i=1

(yi − xTi β)2

that is least squares is equivalent to maximum likelihood for Gaussian noiseBonus: Estimate for σ2:

σ̂2 =1n

n∑i=1

(yi − xTi β̂)2


37/50

Advantages with maximum likelihood

Possible with other models for noise1 t distribution allowing for some large noise terms

Possible for other outcomesGamma distribution for positive continuous dataBinomial distribution for binary dataPoisson distribution for count data


38/50

General ML theory

Maximum likelihood principle:

θ̂ = arg maxθ

L(θ) = arg maxθ

log L(θ)

log Lθ) = log L(θ)

Typically found as the solution of

∂

∂θlog Lθ) = 0

Note: L involves products, log L involves sums. Easier with derivatives on the latterWill only give guaranty to local maximumAs n→∞, log L(θ) will in many cases converge towards a convex function

Global maximum can be obtainedImportant quantity:

J (θ̂) =− ∂2

∂θ∂θTlog L(θ)|θ=θ̂ Fisher’s observed information matrix

θ̂ ≈N(θ,J (θ̂)−1)

Note: Exact results available for linear regression!Geir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 38 / 50

39/50

ML and inference

Confidence intervals: θ̂r ± zα/2std.err(θ̂r )Testing

H0 : θr =a

t =θ̂r − a

std.err(θ̂r )

H0≈ N(0, 1)

P-value ≈2Φ(−|t |)

Testing H0 : gj (θ) = 0, j = 1, ..., qθ̂0 is ML estimate under H0

H0 : θr =a

w = D =2[log L(θ̂)− log L(θ̂)]H0≈ χ2q Likelihood ratio statistics

P-value ≈Pr(χ2q > w)


40/50

Binary variables

Assume y ∼ Binom(n, π)

log L(π) =constant + y log(π) + (n − y) log(1− π)∂

∂πlog L(π) =

yπ− n − y

1− π

π̂ML =yn

∂2

∂π2log L(π) =− y

π2− n − y

(1− π)2

∂2

∂π2log L(π̂) =− n

π̂− n

(1− π̂) = −n

π̂(1− π̂)

Var[π̂] ≈ π̂(1− π̂)n


41/50

Binary variables - two groups

Assume yj ∼ Binom(nj , πj ), j = 1, 2Test: H0 : π1 = π2 = π

Under H0 : π̂ = (y1 + y2)/(n1 + n2)

Under alternative: π̂j = yj/nj

log L(π1, π2) =constant +n∑

j=1

y [log(πj ) + (nj − yj ) log(1− πj )]

D =2[log L(π̂1, π̂2)− log L(π̂, π̂)] Deviance=D0 − D1

D0 =− 2 log L(π̂, π̂)D1 =− 2 log L(π̂1, π̂2)

H0 :D ≈ χ21


42/50

Example

data from Brazilian bank

Response: Satisfaction: low/high

Groups: young/old

satisfaction\ group young old totallow 84 34 118high 225 157 382total 309 191 500π̂ 0.729 0.822 0.764

std.err(π̂) 0.025 0.026 0.019

D = 5.96, P-value 0.015


43/50

Logistic regression

Assume yi ∼ Binom(1, πi ), i = 1, ..., nWant to make πi = π(x i ) for some explanatory variables x iLinear relation

η(x) = β0 + β1x1 + · · ·+ βp−1xp−1

Logistic regression

π(x) =exp(η(x))

1 + exp(η(x))logistic function

=1

1 + exp(−η(x)) sigmoid function

=sigmoid(η(x))


44/50

Logistic function

−4 −2 0 2

0.0

0.2

0.4

0.6

0.8

1.0

x

Logi

stic

func

tion

2−2x2+3x0+x


45/50

Brazilian data

Earlier: Compared two groups, young or old

Age is a numberic variable, could be used as an explanatory variable directly

Brazilian_logist_reg.R


46/50

Generalized linear models

Logistic regression:

Y ∼Binom(n, π(x))

E [Y |x ] =π(x) = eη(x)

1 + eη(x)

g(π) = log(

π

1− π

)g(E [Y |x ]) =η(x) = xTβ

Special case of

Y ∼f (µ(x)) f in exponential family

η(x) =g(µ(x)) = xTβ

E [Y |x ] =µ(x) = g−1(η(x))

Generalized linear modelsInclude Linear regression, Logistic regression, Poisson regression, ...Theme in STK3100


47/50

K -nearest neighbor regression

linear regression:Simple to fitSimple interpretationSimple to perform different testsStrong assumptions on model

Non-parametric methods (machine learning)Do not assume any explicit formAK -nearest neighbor method:

f̂ (x0) =1K

∑x i∈N0

yi

where N0 ⊂ {x1, ..., xn} contain the K nearest points to x0.

yy

x1x1

x 2x 2

Choise of K : Trade-off between bias and varianceGeir Storvik Chapter 2 - linear methods Linear regression Logistic regression January 25, 2021 47 / 50

48/50

Parametric or non-parametric?

If parametric model close to true: Parametric method is better (smaller varians,small bias)

−1.0 −0.5 0.0 0.5 1.0

12

34

−1.0 −0.5 0.0 0.5 1.0

12

34

yy

xx

−1.0 −0.5 0.0 0.5 1.0

12

34

0.2 0.5 1.0

0.0

00

.05

0.1

00

.15

Me

an

Sq

ua

red

Err

or

y

x 1/K


49/50

Parametric or non-parametric?

If parametric model is very wrong: Non-parametric method better (smaller bias)

−1.0 −0.5 0.0 0.5 1.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

0.2 0.5 1.0

0.0

00.0

20.0

40.0

60.0

8

Mean S

quare

d E

rror

−1.0 −0.5 0.0 0.5 1.0

1.0

1.5

2.0

2.5

3.0

3.5

0.2 0.5 1.0

0.0

00.0

50.1

00.1

5

Mean S

quare

d E

rror

yy

x

x

1/K

1/K


50/50

Parametric or non-parametric - high dimension

For one explanatory variable: Many observations can give good results fornon-parametric methods

For many explanatory variables: The K nearest xi ’s to x0 will typically be far awayfrom x0.Gives large bias!

Based on f (x0) ≈ f (x i ) for x i close to x0.

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=1

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=2

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=3

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=4

0.2 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

p=10

0.2 0.5 1.00

.00

.20

.40

.60

.81

.0

p=20

Me

an

Sq

ua

red

Err

or

1/K


Documents

Chapter 2 - linear methods Linear regression Logistic regression2/50 Lectures Course web-page: 3 hours lectures Schedule: 4 hours lectures New plan: Only lectures 14.15-15.00 on Wednesdays