Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings...

Regression and the Bias-Variance Decomposition

William Cohen10-601 April 2008

Readings: Bishop 3.1,3.2

Regression• Technically: learning a function f(x)=y where

y is real-valued, rather than discrete.– Replace livesInSquirrelHill(x1,x2,…,xn) with

averageCommuteDistanceInMiles(x1,x2,…,xn)– Replace userLikesMovie(u,m) with

usersRatingForMovie(u,m)– …

Example: univariate linear regression• Example: predict age from number of publications

0 20 40 60 80 100 120 140 160

Number of Publications

Linear regression• Model: yi = axi + b + εi where εi ~ N(0,σ)• Training Data: (x1,y1),….(xn,yn)• Goal: estimate a,b with w=(a,b)^ ^

iii xy

2)](ˆ[minarg

),|Pr(logmaxarg)|Pr(logmaxarg

)Pr()|Pr(maxarg)|Pr(maxarg

assume MLE

)ˆˆ()(ˆ bxay iii w

Linear regression• Model: yi = axi + b + εi where εi ~ N(0,σ)• Training Data: (x1,y1),….(xn,yn)• Goal: estimate a,b with w=(a,b)• Ways to estimate parameters

– Find derivative wrt parameters a,b– Set to zero and solve

• Or use gradient ascent to solve• Or ….

i2ˆminarg w

Linear regression

d3How to estimate the slope?

xyslope

xxyyxx

2n*var(X)

n*cov(X,Y)

Linear regression

How to estimate the intercept?

bxay ˆˆ

xayb ˆˆ

Bias/Variance Decomposition of Error

• Return to the simple regression problem f:XY y = f(x) +

What is the expected error for a learned h?

noise N(0,)

deterministic

Bias – Variance decomposition of error

learned from D

])Pr()Pr()()([2

dxdxxhxfE DD

true fct dataset

Experiment (the error of which I’d like to predict):

1. Draw size n sample D=(x1,y1),….(xn,yn)

2. Train linear function hD using D

3. Draw a test example (x,f(x)+ε)

4. Measure squared error of hD on that example

Bias – Variance decomposition of error (2)

learned from D

)()( 2, xhxfE DD

true fct dataset

Fix x, then do this experiment:

1. Draw size n sample D=(x1,y1),….(xn,yn)

2. Train linear function hD using D

3. Draw the test example (x,f(x)+ε)

4. Measure squared error of hD on that example

Bias – Variance decomposition of error t

f y ]ˆˆ[2]ˆ[][

]ˆ][[2]ˆ[][ ]ˆ[][

yffyttfyfftEyfftyfftE

)()( 2, xhxfE DD

why not?

really yD^

]ˆ[][]ˆ[][2])ˆ[(][ ]ˆˆ[2]ˆ[][

]ˆ][[2]ˆ[][ ]ˆ[][

yfEfEytEtfEyfEEyffyttfyfftE

yfftyfftEyfftE

Depends on how well learner approximates f

Intrinsic noise

ff )( yf ˆ)(

]ˆ[][]ˆ[][2])ˆ[(])[( ]ˆˆ[2]ˆ[][

]ˆ][[2]ˆ[][ ]ˆ[][

yhEhEyfEfhEyhEhfEyhhyffhyhhfE

yhhfyhhfEyhhfE

Squared difference btwn our long-term expectation for the learners

performance, ED[hD(x)], and what we expect in a representative run

on a dataset D (hat y)

Squared difference between best possible

prediction for x, f(x), and our “long-term”

expectation for what the learner will do if we

averaged over many datasets D, ED[hD(x)]

)}({ xhEh DD)(ˆˆ xhyy DD

VARIANCE

Bias-variance decomposition

How can you reduce bias of a learner?

How can you reduce variance of a learner?

Make the long-term average better approximate the true function f(x)

Make the learner less sensitive to variations in the data

A generalization of bias-variance decomposition to other loss functions

• “Arbitrary” real-valued loss L(t,y) But L(y,y’)=L(y’,y), L(y,y)=0, and L(y,y’)!=0 if y!=y’

• Define “optimal prediction”: y* = argmin y’ L(t,y’)

• Define “main prediction of learner”ym=ym,D = argmin y’ ED{L(t,y’)}

• Define “bias of learner”:B(x)=L(y*,ym)

• Define “variance of learner” V(x)=ED[L(ym,y)]

• Define “noise for x”:N(x) = Et[L(t,y*)]

Claim:

ED,t[L(t,y) ] = c1N(x)+B(x)+c2V(x)

c1=PrD[y=y*] - 1

c2=1 if ym=y*, -1 elsem=n=|D|

Other regression methods

Example: univariate linear regression• Example: predict age from number of publications

0 20 40 60 80 100 120 140 160

Number of Publications

Paul Erdős

Hungarian mathematician, 1913-1996

2671ˆ xy

x ~ 1500

age about 240

Linear regression

xxyyxxa

Summary:

To simplify:

• assume zero-centered data, as we did for PCA

• let x=(x1,…,xn) and y =(y1,…,yn)

• then… 0ˆ

)(ˆ 1

a TT xxyx

Onward: multivariate linear regression

1)(ˆ xxyx TTw

nxx ,....,1x

nyy ,....,1y

)(ˆˆ...ˆˆ

XXXxwxwy

,....,...

,....,

Univariate Multivariate

row is example

col is feature

2)](ˆ[minarg ww i

ii y xww )(̂

)(ˆˆ...ˆˆ

XXXxwxwy

,....,...

,....,

y 2)](ˆ[minarg ww i

ii y xww )(̂

regularized

wwww Ti 2

)](ˆ[minarg 2

1)(ˆ XXIX TT yw

)(ˆˆ...ˆˆ

XXXxwxwy

,....,...

,....,

Multivariate, multiple outputs

knn yy

,....,...

,....,

XXYXWW

)(ˆˆ...ˆˆ

XXXxwxwy

,....,...

,....,

ii y xww )(̂

regularized

wwww Ti 2

)](ˆ[minarg 2

1)(ˆ XXIX TT yw

What does increasing λ do?

)(ˆˆ...ˆˆ

XXXxwxwy

,1...,1 1

ii y xww )(̂

regularized

wwww Ti 2

)](ˆ[minarg 2

1)(ˆ XXIX TT yw

w=(w1,w2) What does fixing w2=0 do (if λ=0)?

Regression trees - summary• Growing tree:

– Split to optimize information gain

• At each leaf node– Predict the majority class

• Pruning tree:– Prune to reduce error on holdout

• Prediction:– Trace path to a leaf and predict

associated majority class

build a linear model, then greedily remove features

estimates are adjusted by (n+k)/(n-k): n=#cases, k=#features

estimated error on training data

using to a linear interpolation of every prediction made by every node on the path

[Quinlan’s M5]

Regression trees: example - 1

Regression trees – example 2

What does pruning do to bias and variance?

Kernel regression• aka locally weighted regression, locally linear

regression, LOESS, …

regression, …• Close approximation to kernel regression:

– Pick a few values z1,…,zk up front– Preprocess: for each example (x,y), replace x with

x=<K(x,z1),…,K(x,zk)>where K(x,z) = exp( -(x-z)2 / 2σ2 )– Use multivariate regression on x,y pairs

regression, LOESS, …What does making the kernel wider do to bias and variance?

Additional readings• P. Domingos, A Unified Bias-Variance Decomposition

and its Applications. Proceedings of the Seventeenth International Conference on Machine Learning (pp. 231-238), 2000. Stanford, CA: Morgan Kaufmann.

• J. R. Quinlan, Learning with Continuous Classes, 5th Australian Joint Conference on Artificial Intelligence, 1992.

• Y. Wang & I. Witten, Inducing model trees for continuous classes, 9th European Conference on Machine Learning, 1997

• D. A. Cohn, Z. Ghahramani, & M. Jordan, Active Learning with Statistical Models, JAIR, 1996.

Regression and the Bias-Variance Decompositionzabokrtsky/courses/npfl104/...31 Additional readings...

Documents

Lecture 1: Course logistics, introduction, bias-variance

Bias / Variance Analysis, Kernel Methods

Bias-Variance Analysis of Ensemble Learning · Procedure for Measuring Bias and Variance Construct B bootstrap replicates of S (e.g., B = 200): S1, …, SB Apply learning algorithm

Bias-Variance Theory - Oregon State Universityweb.engr.oregonstate.edu/~tgd/classes/534/slides/part9.pdf · Bias/Variance Heuristics Models that fit the data poorly have high bias:

Bias Plus Variance Decomposition for Survival Analysis Problems · 2018-07-02 · 2 Bias - Variance Decomposition 2.1 Survival analysis problem Survival analysis deals with the datasets,

Regularization: Ridge Regression and the LASSOstatweb.stanford.edu/~owen/courses/305a/Rudy...Part I: The Bias-Variance Tradeoﬀ Part I The Bias-Variance Tradeoﬀ Statistics 305:

Understanding the Bias-Variance Tradeoff

Bias-Variance Trade-off in ML - University at Buffalosrihari/CSE574/Chap3/3.3-Bias... · 2016. 10. 5. · basis functions ϕ and their no. M ... • Bias-Variance decomposition provides

Bias-Variance Analysis of Support Vector Machines for the ......BIAS-VARIANCE ANALYSIS OF SVMS As brieﬂy outlined, these decompositions suffer of signiﬁcant shortcomings: in particular

Bias-Variance in Machine Learning. Bias-Variance: Outline Underfitting/overfitting: –Why are complex hypotheses bad? Simple example of bias/variance Error

BIAS, VARIANCE , AND ARCING CLASSIFIERS

Lecture 5: Bias and variance (v3) - Stanford Universityweb.stanford.edu/~rjohari/teaching/notes/226_lecture5... · 2016-12-05 · MS&E 226: \Small" Data Lecture 5: Bias and variance

Ensemble learning€¦ · Bias-variance tradeoff • We encountered this trade-off for weights in linear regression • Regularizing introduced bias, but reduced variance • More

The Bias-Variance Tradeoff and the Randomized GACVpapers.nips.cc/paper/1483-the-bias-variance-tradeoff-and...The Bias-Variance TradeofJand the Randomized GACV 621 future. In 'soft

DIGITAL TWIN AI and Machine Learning: The Bias-Variance ... · Introduction The Bias-Variance Decomposition Cross-validation Hyperparameter Selection Reﬂections Robust Evaluation

Bias-Variance Tradeoffs in Program Analysis

Disentangling Bias and Variance in Election Pollsgelman/research/published/polling-errors.p… · Disentangling Bias and Variance in Election Polls Houshmand Shirani-Mehr Stanford

A Uni ed Bias-Variance Decompositionhomes.cs.washington.edu/~pedrod/bvd.pdf · A Uni ed Bias-Variance Decomposition Pedro Domingos Department of Computer Science and Engineering University

Regression Variance-Bias Trade-off

Approximating the Bias and Variance of Chain Ladder