Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings...

Preview:

Citation preview

1

Regression and the Bias-Variance Decomposition

William Cohen10-601 April 2008

Readings: Bishop 3.1,3.2

2

Regression• Technically: learning a function f(x)=y where

y is real-valued, rather than discrete.– Replace livesInSquirrelHill(x1,x2,…,xn) with

averageCommuteDistanceInMiles(x1,x2,…,xn)– Replace userLikesMovie(u,m) with

usersRatingForMovie(u,m)– …

3

Example: univariate linear regression• Example: predict age from number of publications

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120 140 160

Number of Publications

Age

in Y

ears

4

Linear regression• Model: yi = axi + b + εi where εi ~ N(0,σ)• Training Data: (x1,y1),….(xn,yn)• Goal: estimate a,b with w=(a,b)^ ^

ii

iii xy

DDD

2)](ˆ[minarg

),|Pr(logmaxarg)|Pr(logmaxarg

)Pr()|Pr(maxarg)|Pr(maxarg

w

wwww

ww

assume MLE

)ˆˆ()(ˆ bxay iii w

5

Linear regression• Model: yi = axi + b + εi where εi ~ N(0,σ)• Training Data: (x1,y1),….(xn,yn)• Goal: estimate a,b with w=(a,b)• Ways to estimate parameters

– Find derivative wrt parameters a,b– Set to zero and solve

• Or use gradient ascent to solve• Or ….

^ ^ i

i2ˆminarg w

6

Linear regression

d2

d1

x1 x2

y2

y1

x

y

d3How to estimate the slope?

...

2

2

1

1

xxyy

xxyy

xyslope

ii

ii

xxn

yyn1

1

i i

ii

xxyyxx

2n*var(X)

n*cov(X,Y)

7

Linear regression

How to estimate the intercept?

bxay ˆˆ

d2

d1

x1 x2

y2

y1

x

y

d3

xayb ˆˆ

8

Bias/Variance Decomposition of Error

9

• Return to the simple regression problem f:XY y = f(x) +

What is the expected error for a learned h?

noise N(0,)

deterministic

Bias – Variance decomposition of error

10

Bias – Variance decomposition of error

learned from D

])Pr()Pr()()([2

dxdxxhxfE DD

true fct dataset

Experiment (the error of which I’d like to predict):

1. Draw size n sample D=(x1,y1),….(xn,yn)

2. Train linear function hD using D

3. Draw a test example (x,f(x)+ε)

4. Measure squared error of hD on that example

11

Bias – Variance decomposition of error (2)

learned from D

)()( 2, xhxfE DD

true fct dataset

Fix x, then do this experiment:

1. Draw size n sample D=(x1,y1),….(xn,yn)

2. Train linear function hD using D

3. Draw the test example (x,f(x)+ε)

4. Measure squared error of hD on that example

12

Bias – Variance decomposition of error t

f y ]ˆˆ[2]ˆ[][

]ˆ][[2]ˆ[][ ]ˆ[][

)ˆ(

222

22

2

2

yffyttfyfftEyfftyfftE

yfftE

ytE

^

)()( 2, xhxfE DD

why not?

really yD^

13

Bias – Variance decomposition of error

]ˆ[][]ˆ[][2])ˆ[(][ ]ˆˆ[2]ˆ[][

]ˆ][[2]ˆ[][ ]ˆ[][

)ˆ(

222

222

22

2

2,

yfEfEytEtfEyfEEyffyttfyfftE

yfftyfftEyfftE

ytED

Depends on how well learner approximates f

Intrinsic noise

ff )( yf ˆ)(

14

Bias – Variance decomposition of error

]ˆ[][]ˆ[][2])ˆ[(])[( ]ˆˆ[2]ˆ[][

]ˆ][[2]ˆ[][ ]ˆ[][

)ˆ(

222

222

22

2

2

yhEhEyfEfhEyhEhfEyhhyffhyhhfE

yhhfyhhfEyhhfE

yfE

Squared difference btwn our long-term expectation for the learners

performance, ED[hD(x)], and what we expect in a representative run

on a dataset D (hat y)

Squared difference between best possible

prediction for x, f(x), and our “long-term”

expectation for what the learner will do if we

averaged over many datasets D, ED[hD(x)]

)}({ xhEh DD)(ˆˆ xhyy DD

BIAS2

VARIANCE

15

Bias-variance decomposition

How can you reduce bias of a learner?

How can you reduce variance of a learner?

Make the long-term average better approximate the true function f(x)

Make the learner less sensitive to variations in the data

16

A generalization of bias-variance decomposition to other loss functions

• “Arbitrary” real-valued loss L(t,y) But L(y,y’)=L(y’,y), L(y,y)=0, and L(y,y’)!=0 if y!=y’

• Define “optimal prediction”: y* = argmin y’ L(t,y’)

• Define “main prediction of learner”ym=ym,D = argmin y’ ED{L(t,y’)}

• Define “bias of learner”:B(x)=L(y*,ym)

• Define “variance of learner” V(x)=ED[L(ym,y)]

• Define “noise for x”:N(x) = Et[L(t,y*)]

Claim:

ED,t[L(t,y) ] = c1N(x)+B(x)+c2V(x)

where

c1=PrD[y=y*] - 1

c2=1 if ym=y*, -1 elsem=n=|D|

17

Other regression methods

18

Example: univariate linear regression• Example: predict age from number of publications

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120 140 160

Number of Publications

Age

in Y

ears

Paul Erdős

Hungarian mathematician, 1913-1996

2671ˆ xy

x ~ 1500

age about 240

19

Linear regression

d2

d1

x1 x2

y2

y1

x

y

d3

xayb

xxyyxxa

i i

ii

ˆˆ

ˆ 2

Summary:

To simplify:

• assume zero-centered data, as we did for PCA

• let x=(x1,…,xn) and y =(y1,…,yn)

• then… 0ˆ

)(ˆ 1

b

a TT xxyx

20

Onward: multivariate linear regression

1)(ˆ xxyx TTw

nxx ,....,1x

nyy ,....,1y

1

11

)(ˆˆ...ˆˆ

XXXxwxwy

TT

kk

yw

knn

k

xx

xxX

,....,...

,....,

1

111

ny

y...

1

y

Univariate Multivariate

row is example

col is feature

2)](ˆ[minarg ww i

iT

ii y xww )(̂

21

Onward: multivariate linear regression

1

11

)(ˆˆ...ˆˆ

XXXxwxwy

TT

kk

yw

knn

k

xx

xxX

,....,...

,....,

1

111

ny

y...

1

y 2)](ˆ[minarg ww i

iT

ii y xww )(̂

regularized

^

wwww Ti 2

)](ˆ[minarg 2

1)(ˆ XXIX TT yw

22

Onward: multivariate linear regression

1

11

)(ˆˆ...ˆˆ

XXXxwxwy

TT

kk

yw

mnn

m

xx

xxX

,....,...

,....,

1

111

ny

y...

1

y

Multivariate, multiple outputs

knn yy

yyY

,....,...

,....,

1

11

11

1)(

ˆ

XXYXWW

TT

xy

23

Onward: multivariate linear regression

1

11

)(ˆˆ...ˆˆ

XXXxwxwy

TT

kk

yw

knn

k

xx

xxX

,....,...

,....,

1

111

ny

y...

1

y 2)](ˆ[minarg ww i

iT

ii y xww )(̂

regularized

^

wwww Ti 2

)](ˆ[minarg 2

1)(ˆ XXIX TT yw

What does increasing λ do?

24

Onward: multivariate linear regression

1

11

)(ˆˆ...ˆˆ

XXXxwxwy

TT

kk

yw

nx

xX

,1...,1 1

ny

y...

1

y 2)](ˆ[minarg ww i

iT

ii y xww )(̂

regularized

^

wwww Ti 2

)](ˆ[minarg 2

1)(ˆ XXIX TT yw

w=(w1,w2) What does fixing w2=0 do (if λ=0)?

25

Regression trees - summary• Growing tree:

– Split to optimize information gain

• At each leaf node– Predict the majority class

• Pruning tree:– Prune to reduce error on holdout

• Prediction:– Trace path to a leaf and predict

associated majority class

build a linear model, then greedily remove features

estimates are adjusted by (n+k)/(n-k): n=#cases, k=#features

estimated error on training data

using to a linear interpolation of every prediction made by every node on the path

[Quinlan’s M5]

26

Regression trees: example - 1

27

Regression trees – example 2

What does pruning do to bias and variance?

28

Kernel regression• aka locally weighted regression, locally linear

regression, LOESS, …

29

Kernel regression• aka locally weighted regression, locally linear

regression, …• Close approximation to kernel regression:

– Pick a few values z1,…,zk up front– Preprocess: for each example (x,y), replace x with

x=<K(x,z1),…,K(x,zk)>where K(x,z) = exp( -(x-z)2 / 2σ2 )– Use multivariate regression on x,y pairs

30

Kernel regression• aka locally weighted regression, locally linear

regression, LOESS, …What does making the kernel wider do to bias and variance?

31

Additional readings• P. Domingos, A Unified Bias-Variance Decomposition

and its Applications. Proceedings of the Seventeenth International Conference on Machine Learning (pp. 231-238), 2000. Stanford, CA: Morgan Kaufmann.

• J. R. Quinlan, Learning with Continuous Classes, 5th Australian Joint Conference on Artificial Intelligence, 1992.

• Y. Wang & I. Witten, Inducing model trees for continuous classes, 9th European Conference on Machine Learning, 1997

• D. A. Cohn, Z. Ghahramani, & M. Jordan, Active Learning with Statistical Models, JAIR, 1996.

Recommended