Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Outlines
OverviewIntroductionLinear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel MethodsSparse Kernel Methods
Mixture Models and EM 1Mixture Models and EM 2Neural Networks 1Neural Networks 2Principal Component Analysis
AutoencodersGraphical Models 1
Graphical Models 2
Graphical Models 3
Sampling
Sequential Data 1
Sequential Data 2
1of 825
Statistical Machine Learning
Christian Walder
Machine Learning Research GroupCSIRO Data61
and
College of Engineering and Computer ScienceThe Australian National University
CanberraSemester One, 2020.
(Many figures from C. M. Bishop, "Pattern Recognition and Machine Learning")
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
113of 825
Part III
Linear Regression 1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
114of 825
Linear Regression
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N0 2 4 6 8 10
x
0
1
2
3
4
5
6
7
8
tPredictor y(x,w)?Performance measure?Optimal solution w∗?Recall: projection, inverse
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
115of 825
Probabilities, Losses
Gaussian DistributionBayes RuleExpected Loss
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
116of 825
Linear Curve Fitting - Least Squares
N = 10
x ≡ (x1, . . . , xN)T
t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N
y(x,w) = w1x + w0
X ≡ [x 1]
w∗ = (XTX)−1XT t
0 2 4 6 8 10x
0
1
2
3
4
5
6
7
8
t
We assume
t = y(x,w)︸ ︷︷ ︸deterministic
+ ε︸︷︷︸Gaussian noise
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
117of 825
Curve fitting - revisited
a priori belief about the parameter w captured in the priorprobability p(w)
observed data D = {t1, . . . , tN}calculate the belief in w after the data D have beenobserved
p(w | D) = p(D |w)p(w)
p(D)p(D |w) as a function of w : likelihood functionlikelihood expresses how probable the data are fordifferent values of w — it is not a probability density withrespect w (but it is with respect to D ; prove it)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
118of 825
Maximum Likelihood
Consider the linear regression problem, where we haverandom variables xn and tn.We assume a conditional model tn|xn
We propose a distribution, parameterized by θ
tn|xn ∼ density(θ)
For a given θ the density defines the probability ofobserving tn|xn.We are interested in finding θ that maximises theprobability (called the likelihood) of the data.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
119of 825
Likelihood Function - Frequentist versus Bayesian
Likelihood function p(D |w)
Frequentist Approachw considered fixedparametervalue defined bysome ’estimator’error bars on theestimated wobtained from thedistribution ofpossible data sets D
Bayesian Approachonly one single dataset Duncertainty in theparameters comesfrom a probabilitydistribution over w
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
120of 825
Frequentist Estimator - Maximum Likelihood
choose w for which the likelihood p(D |w) (the probabilityof the observed data) is maximalthe most common heuristic for learning a single fixed wequivalently: error function is negative log of likelihoodfunction, to be minimisedlog is a monotonic functionmaximising the likelihood ⇐⇒ minimising the errorExample: Fair-looking coin is tossed three times, alwayslanding on heads.Maximum likelihood estimate of the probability of landingheads will give 1.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
121of 825
Bayesian Approach
including prior knowledge easy (via prior w)subjective choice of prior, allows better results byincorporating domain knowledgesometimes choice of prior motivated by convinientmathematical formprior irrelevant as N →∞, but helps for small Nneed to sum/integrate over the whole parameter space
advances in sampling (Markov Chain Monte Carlo methods)advances in approximation schemes (Variational Bayes,Expectation Propagation)
there is no true w:
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
122of 825
Regression
Given a training data set of N observations {xn} and targetvalues tn.Goal : Learn to predict the value of one ore more targetvalues t given a new value of the input x.Example: Polynomial curve fitting (see Introduction).
x
t
?
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
123of 825
Supervised Learning:(non-Bayesian) Point Estimate
model with adjustable parameter w
training data x
training targets
t
model with fixed parameter w*
test data x
test target
t
Training Phase
Test Phasefix the most appropriate w*
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
124of 825
Why Linear Regression?
Analytic solution when minimising sum of squared errorsWell understood statistical behaviourEfficient algorithms exist for convex losses andregularizersBut what if the relationship is non-linear?
0 1 2 3 4 5 61.00
0.75
0.50
0.25
0.00
0.25
0.50
0.75
1.00
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
125of 825
Linear Basis Function Models
Linear combination of fixed nonlinear basis functions
y(x,w) =
M−1∑j=0
wjφj(x) = wTφ(x)
parameter w = (w0, . . . ,wM−1)T
basis functions φ(x) = (φ0(x), . . . , φM−1(x))T
convention φ0(x) = 1w0 is the bias parameter
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
126of 825
Polynomial Basis Functions
Scalar input variable x
φj(x) = xj
Limitation : Polynomials are global functions of the inputvariable x so the learned function will extrapolate poorly
−1 0 1−1
−0.5
0
0.5
1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
127of 825
’Gaussian’ Basis Functions
Scalar input variable x
φj(x) = exp
{− (x− µj)
2
2s2
}Not a probability distribution.No normalisation required, taken care of by the modelparameters w.Well behaved away from the data (though pulled to zero)
−1 0 10
0.25
0.5
0.75
1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
128of 825
Sigmoidal Basis Functions
Scalar input variable x
φj(x) = σ
(x− µj
s
)where σ(a) is the logistic sigmoid function defined by
σ(a) =1
1 + exp(−a)
σ(a) is related to the hyperbolic tangent tanh(a) bytanh(a) = 2σ(a)− 1.
−1 0 10
0.25
0.5
0.75
1
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
129of 825
Other Basis Functions
Fourier Basis : each basis function represents a specificfrequency and has infinite spatial extent.Wavelets : localised in both space and frequency (alsomutually orthogonal to simplify appliciation).Splines (piecewise polynomials restricted to regions of theinput space; additional constraints where pieces meet, e.g.smoothness constraints→ conditions on the derivatives).
0 1 2 3 4 5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
LinearSplines
0 1 2 3 4 5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
QuadraticSplines
0 1 2 3 4 5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
CubicSplines
0 1 2 3 4 5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
QuarticSplines
Approximate the points{(0, 0), (1, 1), (2,−1), (3, 0), (4,−2), (5, 1)} by different
splines.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
130of 825
Maximum Likelihood and Least Squares
No special assumption about the basis functions φj(x). Inthe simplest case, one can think of φj(x) = xj, or φ(x) = x.Assume target t is given by
t = y(x,w)︸ ︷︷ ︸deterministic
+ ε︸︷︷︸noise
where ε is a zero-mean Gaussian random variable withprecision (inverse variance) β.Thus
p(t | x,w, β) = N (t | y(x,w), β−1)
t
xx0
y(x0)
y(x)
p(t|x0)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
131of 825
Maximum Likelihood and Least Squares
Likelihood of one target t given the data x was
p(t | x,w, β) = N (t | y(x,w), β−1)
Now, a set of inputs X with corresponding target values t.Assume data are independent and identically distributed(i.i.d.) (means : data are drawn independent and from thesame distribution). The likelihood of the target t is then
p(t |X,w, β) =N∏
n=1
N (tn | y(xn,w), β−1)
=
N∏n=1
N (tn |wTφ(xn), β−1)
From now on drop the conditioning variable X from thenotation, as with supervised learning we do not seek tomodel the distribution of the input data.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
132of 825
Maximum Likelihood and Least Squares
Consider the logarithm of the likelihood p(t |w, β) (thelogarithm is a monotone function! )
ln p(t |w, β) =N∑
n=1
lnN (tn |wTφ(xn), β−1)
=
N∑n=1
ln
(√β
2πexp
{−β
2(tn − wTφ(xn))
2})
=N2lnβ − N
2ln(2π)− βED(w)
where the sum-of-squares error function is
ED(w) =12
N∑n=1
{tn − wTφ(xn)}2.
argmaxw ln p(t |w, β)→ argminw ED(w)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
133of 825
Maximum Likelihood and Least Squares
Goal: Find a more compact representation.Rewrite the error function
ED(w) =12
N∑n=1
{tn − wTφ(xn)}2 =12(t−Φw)T(t−Φw)
where t = (t1, . . . , tN)T , and
Φ =
φ0(x1) φ1(x1) . . . φM−1(x1)φ0(x2) φ1(x2) . . . φM−1(x2)
......
. . ....
φ0(xN) φ1(xN) . . . φM−1(xN)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
134of 825
Maximum Likelihood and Least Squares
The log likelihood is now
ln p(t |w, β) = N2lnβ − N
2ln(2π)− βED(w)
=N2lnβ − N
2ln(2π)− β 1
2(t−Φw)T(t−Φw)
Find critical points of ln p(t |w, β).The gradient with respect to w is
∇w ln p(t |w, β) = βΦT(t−Φw).
Setting the gradient to zero gives
0 = ΦT t−ΦTΦw,
which results in
wML = (ΦTΦ)−1ΦT t = Φ†t
where Φ† is the Moore-Penrose pseudo-inverse of thematrix Φ.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
135of 825
Maximum Likelihood and Least Squares
The log likelihood with the optimal wML is now
ln p(t |wML, β)
=N2lnβ − N
2ln(2π)− β 1
2(t−ΦwML)
T(t−ΦwML)
Find critical points of ln p(t |w, β) wrt β,
∂ ln p(t |wML, β)
∂β= 0
results in
1βML
=1N(t−ΦwML)
T(t−ΦwML)
Note: We can first find the maximum likelihood for w asthis does not depend on β. Then we can use wML to findthe maximum likelihood solution for β.Could we have chosen optimisation wrt β first, and thenwrt to w ?
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
136of 825
Sequential Learning - Stochastic GradientDescent
For large data sets, calculating the maximum likelihoodparameters wML and βML may be costly.For online applications, never all data in memory.Use a sequential algorithms (online algorithm).If the error function is a sum over data points E =
∑n En,
then1 initialise w(0) to some starting value2 update the parameter vector at iteration τ + 1 by
w(τ+1) = w(τ) − η∇En,
where En is the error function after presenting the nth dataset, and η is the learning rate.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
137of 825
Sequential Learning - Stochastic GradientDescent
For the sum-of-squares error function, stochastic gradientdescent results in
w(τ+1) = w(τ) + η(
tn − w(τ)Tφ(xn))φ(xn)
The value for the learning rate must be chosen carefully. Atoo large learning rate may prevent the algorithm fromconverging. A too small learning rate does follow the datatoo slowly.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
138of 825
Regularized Least Squares
Add regularisation in order to prevent overfitting
ED(w) + λEW(w)
with regularisation coefficient λ.Simple quadratic regulariser
EW(w) =12
wTw
Maximum likelihood solution
w =(λI +ΦTΦ
)−1ΦT t
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
139of 825
Regularized Least Squares
More general regulariser
EW(w) =12
M∑j=1
|wj|q
q = 1 (lasso) leads to a sparse model if λ large enough.
q = 0.5 q = 1 q = 2 q = 4
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
140of 825
Lagrangian Dual View of the Regulariser
By the Lagrange multiplier method, minimization of theregularized error function
12
N∑n=1
(tn − w>φ(xn))2 +
λ
2
M∑j=1
|wj|q ,
is equivalent to minimizing the unregularizedsum-of-squares error,
12
N∑n=1
(tn − w>φ(xn))2 subject to
M∑j=1
|wj|q ≤ η.
This yields the figures on the next slide.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
141of 825
Comparison of Quadratic and Lasso Regulariser
Quadratic regulariser
12
M∑j=1
w2j
w1
w2
w?
Lasso regulariser
12
M∑j=1
|wj|
w1
w2
w?
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
142of 825
Multiple Outputs
More than 1 target variable per data point.y becomes a vector instead of a scalar. Each dimensioncan be treated with a different set of basis functions (andthat may be necessary if the data in the different targetdimensions represent very different types of information.)Here we restrict ourselves to the SAME basis functions
y(x,w) = WTφ(x)
where y is a K-dimensional column vector, W is an M × Kmatrix of model parameters, andφ(x) =
(φ0(x), . . . , φM−1(x)
), with φ0(x) = 1, as before.
Define target matrix T containing the target vector tTn in the
nth row.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
143of 825
Multiple Outputs
Suppose the conditional distribution of the target vector isan isotropic Gaussian of the form
p(t | x,W, β) = N (t |WTφ(x), β−1I).
The log likelihood is then
ln p(T |X,W, β) =
N∑n=1
lnN (tn |WTφ(xn), β−1I)
=NK2
ln
(β
2π
)− β
2
N∑n=1
‖tn −WTφ(xn)‖2
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
144of 825
Multiple Outputs
Maximisation with respect to W results in
WML = (ΦTΦ)−1ΦTT.
For each target variable tk, we get
wk = (ΦTΦ)−1ΦT tk = Φ†tk.
The solution between the different target variablesdecouples.Holds also for a general Gaussian noise distribution witharbitrary covariance matrix.Why? W defines the mean of the Gaussian noisedistribution. And the maximum likelihood solution for themean of a multivariate Gaussian is independent of thecovariance.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
145of 825
Loss Function for Regression
Over-fitting results from a large number of basis functionsand a relatively small training set.Regularisation can prevent overfitting, but how to find thecorrect value for the regularisation constant λ ?Frequentists viewpoint of the model complexity is thebias-variance trade-off.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
146of 825
Loss Function for Regression
Choose an estimator y(x) to estimate the target value t foreach input x.Choose a loss function L(t, y(x)) which measures thedifference between the target t and the estimate y(x).The expected loss is then
E [L] =∫∫
L(t, y(x)) p(x, t) dx dt
Common choice: Squared Loss
L(t, y(x)) = {y(x)− t}2.
Expected loss for squared loss function
E [L] =∫∫{y(x)− t}2 p(x, t) dx dt.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
147of 825
Loss Function for Regression
Expected loss for squared loss function
E [L] =∫∫{y(x)− t}2 p(x, t) dx dt.
Minimise E [L] by choosing the regression function
y(x) =∫
t p(x, t) dtp(x)
=
∫t p(t | x) dt = Et [t | x]
(calculus of variations is not required to derive this result ;we may work point-wise by fixing an x and usingstationarity to solve for y(x) — why is that sufficient?).
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
148of 825
Optimal Predictor for Squared Loss
The regression function which minimises the expectedsquared loss, is given by the mean of the conditionaldistribution p(t | x).
t
xx0
y(x0)
y(x)
p(t|x0)
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
149of 825
Analysing the Squared Loss (1)
Analyse the expected loss
E [L] =∫∫{y(x)− t}2 p(x, t) dx dt.
Rewrite the squared loss
{y(x)− t}2= {y(x)− E [t | x] + E [t | x]− t}2
= {y(x)− E [t | x]}2+ {E [t | x]− t}2
+ 2 {y(x)− E [t | x]} {E [t | x]− t}
Claim∫∫{y(x)− E [t | x]} {E [t | x]− t} p(x, t) dx dt = 0.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
150of 825
Analysing the Squared Loss (2)
Claim∫∫{y(x)− E [t | x]} {E [t | x]− t} p(x, t) dx dt = 0.
Seperate functions depending on t from functiondepending on x∫
{y(x)− E [t | x]}(∫{E [t | x]− t} p(x, t) dt
)dx
Calculate the integral over t∫{E [t | x]− t} p(x, t) dt = E [t | x] p(x)− p(x)
∫t p(x, t)
p(x)dt
= E [t | x] p(x)− p(x)E [t | x]= 0
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
151of 825
Analysing the Squared Loss (3)
The expected loss is now
E [L] =∫{y(x)− E [t | x]}2p(x) dx +
∫var[t | x] p(x) dx (1)
Minimise first term by choosing y(x) = E [t | x] (as we sawalready).Second term represents the intrinsic variability of thetarget data (can be regarded as noise). Independent of thechoice y(x), can not be reduced by learning a better y(x).
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
152of 825
The Bias-Variance Decomposition (1)
Consider again squared loss for which the optimalprediction is given by the conditional expectation h(x)
h(x) = E [t | x] =∫
t p(t | x) dt.
Since h(x) is unavailable to us, it must be estimated from a(finite) dataset D.D is a finite sample from the unknown joint p(x, t)Notate the dependency of the learned function on the databy y(x;D).Evaluate performance of algorithm by taking theexpectation ED [L] over all data sets D
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
153of 825
The Bias-Variance Decomposition (2)
Taking the expectation over data sets D, using Eqn 1, andinterchanging the order of expectations for the first term:
ED [E [L]] =∫
ED[{y(x;D)− h(x)}2] p(x) dx
+
∫∫{h(x)− t}2p(x, t) dx dt
Again, add and subtract the expectation ED [y(x;D)]
{y(x;D)− h(x)}2 = { y(x;D)− ED [y(x;D)]+ ED [y(x;D)]− h(x)}2
and show that the mixed term vanishes under theexpectation ED.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
154of 825
The Bias-Variance Decomposition (3)
Expected loss ED [L] over all data sets D
expected loss = (bias)2 + variance + noise.
where
(bias)2 =
∫{ED [y(x;D)]− h(x)}2 p(x) dx
variance =
∫ED[{y(x;D)− ED [y(x;D)]}2
]p(x) dx
noise =
∫∫{h(x)− t}2 p(x, t) dx dt.
(bias)2 : How accurate is a model across different trainingsets? (How much does the average prediction over alldata sets differ from the desired regression function ?)variance : How sensitive is the model to small changes inthe training set? (How much do solutions for individualdata sets vary around their average ?
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
155of 825
The Bias-Variance Decomposition
Simple models have low variance and high bias.
x
tln λ = 2.6
0 1
−1
0
1
x
t
0 1
−1
0
1
Left: Result of fitting the model to 100 data sets, only 25 shown.Right: Average of the 100 fits in red, the sinusoidal functionfrom where the data were created in green.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
156of 825
The Bias-Variance Decomposition
Complex models have high variance and low bias.
x
tln λ = −2.4
0 1
−1
0
1
x
t
0 1
−1
0
1
Left: Result of fitting the model to 100 data sets, only 25 shown.Right: Average of the 100 fits in red, the sinusoidal functionfrom where the data were created in green.
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
157of 825
The Bias-Variance Decomposition
Dependence of bias and variance on the model complexitySquared bias, variance, their sum, and test dataThe minimum for (bias)2 + variance occurs close to thevalue that gives the minimum error
ln λ
−3 −2 −1 0 1 20
0.03
0.06
0.09
0.12
0.15
(bias)2
variance
(bias)2 + variancetest error
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
158of 825
Unbiased Estimators
You may have encountered unbiased estimatorsWhy guarantee zero bias? To quote the pioneer ofBayesian inference, Edwin Jaynes, from his bookProbability Theory: The Logic of Science (2003):
Statistical MachineLearning
c©2020Ong & Walder & Webers
Data61 | CSIROThe Australian National
University
Review
Linear Basis FunctionModels
Maximum Likelihood andLeast Squares
Sequential Learning
Regularized LeastSquares
Multiple Outputs
Loss Function forRegression
The Bias-VarianceDecomposition
159of 825
The Bias-Variance Decomposition
Tradeoff between bias and variancesimple models have low variance and high biascomplex models have high variance and low bias
The sum of bias and variance has a minimum at a certainmodel complexity.Expected loss ED [L] over all data sets D
expected loss = (bias)2 + variance + noise.
The noise comes from the data, and can not be removedfrom the expected loss.To analyse the bias-variance decomposition : many datasets needed, which are not always available.