Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Statistical Machine Learning: Introduction
Dino SejdinovicDepartment of Statistics
University of Oxford
22-24 June 2015, Novi Sadslides available at:
http://www.stats.ox.ac.uk/~sejdinov/talks.html
Introduction Introduction
What is Machine Learning?
Arthur Samuel, 1959Field of study that gives computers the ability to learn without being explicitlyprogrammed.
Tom Mitchell, 1997Any computer program that improves its performance at some task throughexperience.
Kevin Murphy, 2012
To develop methods that can automatically detect patterns in data, andthen to use the uncovered patterns to predict future data or other outcomesof interest.
Introduction Introduction
What is Machine Learning?
Arthur Samuel, 1959Field of study that gives computers the ability to learn without being explicitlyprogrammed.
Tom Mitchell, 1997Any computer program that improves its performance at some task throughexperience.
Kevin Murphy, 2012
To develop methods that can automatically detect patterns in data, andthen to use the uncovered patterns to predict future data or other outcomesof interest.
Introduction Introduction
What is Machine Learning?
Arthur Samuel, 1959Field of study that gives computers the ability to learn without being explicitlyprogrammed.
Tom Mitchell, 1997Any computer program that improves its performance at some task throughexperience.
Kevin Murphy, 2012
To develop methods that can automatically detect patterns in data, andthen to use the uncovered patterns to predict future data or other outcomesof interest.
Introduction Introduction
What is Machine Learning?
data
InformationStructurePredictionDecisionsActions
http://gureckislab.org Larry Page about DeepMind’s ML systems that can learn to play video games like humans
Introduction Introduction
What is Machine Learning?
Machine Learning
statistics
computerscience
cognitivescience
psychology
mathematics
engineeringoperationsresearch
physics
biologygenetics
businessfinance
Introduction Introduction
What is Data Science?
’Data Scientists’ Meld Statistics and Software WSJ article
Introduction Introduction
Information Revolution
Traditional Statistical Inference ProblemsWell formulated question that we would like to answer.Expensive data gathering and/or expensive computation.Create specially designed experiments to collect high quality data.
Information RevolutionImprovements in data processing and data storage.Powerful, cheap, easy data capturing.Lots of (low quality) data with potentially valuable information inside.
Introduction Introduction
Statistics and Machine Learning in the age of Big Data
ML becoming a thorough blending of computer science, engineering andstatisticsunified framework of data, inferences, procedures, algorithms
statistics taking computation seriouslycomputing taking statistical risk seriously
scale and granularity of datapersonalization, societal and business impactmultidisciplinarity - and you are the interdisciplinary glueit’s just getting started
Michael Jordan: On the Computational and Statistical Interface and "Big Data"
Introduction Applications of Machine Learning
Applications of Machine Learning
spam filteringrecommendation
systemsfraud detection
self-driving carsimage recognition
stock market analysis
ImageNet: Krizhevsky et al, 2012; Machine Learning is Eating the World: Forbes article
Introduction Types of Machine Learning
Types of Machine Learning
Supervised learning
Data contains “labels”: every example is an input-output pairclassification, regressionGoal: prediction on new examples
Unsupervised learning
Extract key features of the “unlabelled” dataclustering, signal separation, density estimationGoal: representation, hypothesis generation, visualization
Introduction Types of Machine Learning
Types of Machine Learning
Semi-supervised Learning
A database of examples, only a small subset of which are labelled.
Multi-task Learning
A database of examples, each of which has multiple labels corresponding todifferent prediction tasks.
Reinforcement Learning
An agent acting in an environment, given rewards for performing appropriateactions, learns to maximize their reward.
Introduction Types of Machine Learning
Literature
Supervised Learning Supervised Learning
Supervised Learning
We observe the input-output pairs {(xi, yi)}ni=1, xi ∈ X , yi ∈ Y
Types of supervised learning:Classification: discrete responses, e.g. Y = {+1,−1} or {1, . . . ,K}.Regression: a numerical value is observed and Y = R.
The goal is to accurately predict the response Y on new observations of X,i.e., to learn a function f : Rp → Y, such that f (X) will be close to the trueresponse Y.
Supervised Learning Decision Theory
Loss function
Suppose we made a prediction Y = f (X) ∈ Y based on observation of X.How good is the prediction? We can use a loss function L : Y ×Y 7→ R+
to formalize the quality of the prediction.Typical loss functions:
Misclassification loss (or 0-1 loss) for classification
L(Y, f (X)) =
{0 f (X) = Y1 f (X) 6= Y
.
Squared loss for regression
L(Y, f (X)) = (f (X)− Y)2 .
Many other choices are possible, e.g., weighted misclassification loss.In classification, the vector of estimated probabilities (p(k))k∈Y can bereturned, and in this case log-likelihood loss (or log loss)L(Y, p) = − log p(Y) is often used.
Supervised Learning Decision Theory
Risk
Paired observations {(xi, yi)}ni=1 viewed as i.i.d. realizations of a random
variable (X,Y) on X × Y with joint distribution PXY
RiskFor a given loss function L, the risk R of a learned function f is given by theexpected loss
R(f ) = EPXY [L(Y, f (X))] ,
where the expectation is with respect to the true (unknown) joint distribution of(X,Y).
The risk is unknown, but we can compute the empirical risk:
Rn(f ) =1n
n∑i=1
L(yi, f (xi)).
Supervised Learning Decision Theory
The Bayes Classifier
What is the optimal classifier if the joint distribution (X,Y) were known?The density g of X can be written as a mixture of K components(corresponding to each of the classes):
g(x) =
K∑k=1
πkgk(x),
where, for k = 1, . . . ,K,P(Y = k) = πk are the class probabilities,gk(x) is the conditional density of X, given Y = k.
The Bayes classifier fBayes : x 7→ {1, . . . ,K} is the one with minimum risk:
R(f ) =E [L(Y, f (X))] = EX[EY|X[L(Y, f (X))|X]
]=
∫XE [L(Y, f (X))|X = x] g(x)dx
The minimum risk attained by the Bayes classifier is called Bayes risk.Minimizing E[L(Y, f (X))|X = x] separately for each x suffices.
Supervised Learning Decision Theory
The Bayes Classifier
Consider the 0-1 loss.The risk simplifies to:
E[L(Y, f (X))
∣∣X = x]
=
K∑k=1
L(k, f (x))P(Y = k|X = x)
=1− P(Y = f (x)|X = x)
The risk is minimized by choosing the class with the greatest posteriorprobability:
fBayes(x) = arg maxk=1,...,K
P(Y = k|X = x)
= arg maxk=1,...,K
πkgk(x)∑Kj=1 πjgj(x)
= arg maxk=1,...,K
πkgk(x).
The functions x 7→ πkgk(x) are called discriminant functions. Thediscriminant function with maximum value determines the predicted classof x.
Supervised Learning Decision Theory
The Bayes Classifier: ExampleA simple two Gaussians example: Suppose X ∼ N (µY , 1), where µ1 = −1 andµ2 = 1 and assume equal priors π1 = π2 = 1/2.
g1(x) =1√2π
exp(− (x + 1)2
2
)and g2(x) =
1√2π
exp(− (x− 1)2
2
).
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
x
cond
ition
al d
ensi
ties
−3 −2 −1 0 1 2 3
0.05
0.10
0.15
0.20
0.25
x
mar
gina
l den
sity
Optimal classification is fBayes(x) = arg maxk=1,...,K
πkgk(x) =
{1 if x < 0,2 if x ≥ 0.
Supervised Learning Decision Theory
The Bayes Classifier: Example
How do you classify a new observation x if now the standard deviation is still 1for class 1 but 1/3 for class 2?
−3 −2 −1 0 1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
1.2
x
cond
ition
al d
ensi
ties
−3 −2 −1 0 1 2 3
1e−
321e
−25
1e−
181e
−11
1e−
04
x
cond
ition
al d
ensi
ties
Looking at density in a log-scale, optimal classification is to select class 2 ifand only if x ∈ [0.34, 2.16].
Supervised Learning Decision Theory
Plug-in Classification
The Bayes Classifier chooses the class with the greatest posteriorprobability
fBayes(x) = arg maxk=1,...,K
πkgk(x).
We know neither the conditional densities gk nor the class probabilities πk!The plug-in classifier chooses the class
f (x) = arg maxk=1,...,K
πkgk(x),
where we plugged inestimates πk of πk and k = 1, . . . ,K andestimates gk(x) of conditional densities,
Supervised Learning Linear Discriminant Analysis
Linear Discriminant Analysis
LDA is the simplest example of plug-in classification.Assume multivariate normal conditional density gk(x) for each class k:
X|Y = k ∼N (µk,Σ),
gk(x) =(2π)−p/2|Σ|−1/2 exp(−1
2(x− µk)
>Σ−1(x− µk)
),
each class can have a different mean µk,all classes share the same covariance Σ.
For an observation x, the k-th log-discriminant function is
logπkgk(x) = const + logπk −12
(x− µk)>Σ−1(x− µk)
= const + ak + b>k x
linear log-discriminant functions↔ linear decision boundaries.The quantity (x− µk)
>Σ−1(x− µk) is the squared Mahalanobis distancebetween x and µk.If Σ = Ip and πk = 1
K , LDA simply chooses the class k with the nearest (inthe Euclidean sense) class mean.
Supervised Learning Linear Discriminant Analysis
Iris Dataset
●●● ●●
●
●
●●
●
●●
●●
●
●●
● ●●
●
●
●
●
●●
●
●● ●●
●
●
●●●●
●
● ●
●●
●
●
●
●
●●●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
1 2 3 4 5 6 7
0.5
1.0
1.5
2.0
2.5
Petal.Length
Pet
al.W
idth
Supervised Learning Statistical Learning Theory
Generative vs Discriminative Learning
Generative learning: find parameters which explain all the dataavailable.
θ = argmaxθ
n∑i=1
log P(xi, yi|θ)
Examples: LDA, naïve Bayes.Makes use of all the data available.Flexible framework, can incorporate other tasks, incomplete data.Stronger modelling assumptions.
Discriminative learning: find parameters that aid in prediction.
θ = argminθ
1n
n∑i=1
L(yi, fθ(xi)) or θ = argmaxθ
n∑i=1
log P(yi|xi, θ)
Examples: logistic regression, support vector machines.Typically performs better on a given task.Weaker modelling assumptions.Can overfit more easily.
Supervised Learning Statistical Learning Theory
Generative Learning
We work with a joint distribution PX,Y(x, y) over data vectors and labels.A learning algorithm: construct f : X → Y which predicts the label of X.Given a loss function L, the risk R of f (X) is
R(f ) = EX,Y [L(Y, f (X))]
For 0/1 loss in classification, Bayes classifier
fBayes(x) = argmaxk=1,...,K
P(Y = k|x) = argmaxk=1,...,K
PX,Y(x, k)
has the minimum risk (Bayes risk), but is unknown since PX,Y is unknown.Assume a parameteric model for the joint: PX,Y(x, y) = PX,Y(x, y|θ)Fit θ = argmaxθ
∑ni=1 log P(xi, yi|θ) and plug in back to Bayes classifier:
f (x) = argmaxk=1,...,K
PX,Y(x, k|θ).
Supervised Learning Statistical Learning Theory
Hypothesis space and Empirical Risk Minimization
Find best function in H minimizing the risk:
f? = argminf∈H
EX,Y [L(Y, f (X))]
Empirical Risk Minimization (ERM): minimize the empirical risk instead,since we typically do not know PX,Y .
f = argminf∈H
1n
n∑i=1
L(yi, f (xi))
Hypothesis space H is the space of functions f under consideration.How complex should we allow functions f to be? If hypothesis space H is“too large”, ERM will overfit. Function
f (x) =
{yi if x = xi,
0 otherwise
will have zero empirical risk, but is useless for generalization, since it hassimply “memorized” the dataset.
Supervised Learning Statistical Learning Theory
Training and Test Performance
Training error is the empirical risk
1n
n∑i=1
L(yi, f (xi))
For 0-1 loss in classification, this is the misclassification error on thetraining data {xi, yi}n
i=1, which were used in learning f .Test error is the empirical risk on new, previously unseen observations{xi, yi}m
i=1
1m
m∑i=1
L(yi, f (xi))
which were NOT used in learning f .Test error is a much better gauge of how well learned functiongeneralizes to new data.The test error is in general larger than the training error.
Supervised Learning Statistical Learning Theory
Hypothesis space for two-class LDA
Assume we have two classes {+1,−1}.Recall that the discriminant functions in LDA are linear. Assuming thatdata vectors in class k is modelled as N (µk,Σ), choosing class +1 over−1 involves:
a+1 + b>+1x > a−1 + b>−1x ⇔ a? + b>? x > 0,
where a? = a+1 − a−1, b? = b+1 − b−1.Thus, hypothesis space of two-class LDA consists of functionsf (x) = sign(a + b>x).We obtain coefficients a and b, and thus the function f through fitting theparameters of the generative model.Discriminative learning: restrict H to a class of functionsf (x) = sign(a + b>x) and select a and b which minimize empirical risk.
Supervised Learning Statistical Learning Theory
Space of linear decision functions
Hypothesis space H = {f : f (x) = sign(a + b>x), a ∈ R, b ∈ Rp}Find a, b that minimize the empirical risk under 0-1 loss:
argmina,b
1n
n∑i=1
L(yi, fa,b(xi))
= argmina,b
1n
n∑i=1
{0 if yi = sign(a + b>xi)
1 otherwise
= argmina,b
12n
n∑i=1
[1− sign(yi(a + b>xi))
].
Combinatorial problem - not typically possible to solve...Maybe easier with a different loss function? (Logistic regression)
Supervised Learning Logistic Regression
Linearity of log-posterior odds
Another way to express linear decision boundary of LDA:
logp(Y = +1|X = x)
p(Y = −1|X = x)= a + b>x.
Solve explicitly for conditional class probabilities:
p(Y = +1|X = x) =1
1 + exp(−(a + b>x))=: s(a + b>x)
p(Y = −1|X = x) =1
1 + exp(+(a + b>x))= s(−a− b>x)
where s(·) is the logistic function
−8 −6 −4 −2 0 2 4 6 80
0.5
1
Supervised Learning Logistic Regression
Logistic Regression
Consider maximizing the conditional log likelihood:
`(a, b) =
n∑i=1
log p(Y = yi|X = xi) =
n∑i=1
− log(1 + exp(−yi(a + b>xi)))
Equivalent to minimizing the empirical risk associated with the log loss:
Remplog =
1n
n∑i=1
log(1 + exp(−yi(a + b>xi))) =1n
n∑i=1
− log(s(yi(a + b>xi))︸ ︷︷ ︸pa,b(yi|xi)
)
−4 −3 −2 −1 0 1 2 3 40
1
2
3
4
Log Loss (log(1+exp(−y(a+b’x))))
0−1 Loss (1/2+sign(−y(a+b’x))/2)
Supervised Learning Logistic Regression
Logistic Regression
Not possible to find optimal a, b analytically.For simplicitiy, absorb a as an entry in b byappending ’1’ into x vector.Objective function:
Remplog =
1n
n∑i=1
− log s(yix>i b)
Logistic Function
s(−z) = 1− s(z)
∇zs(z) = s(z)s(−z)
∇z log s(z) = s(−z)
∇2z log s(z) = −s(z)s(−z)
Differentiate wrt b:
∇bRemplog =
1n
n∑i=1
−s(−yix>i b)yixi
∇2bRemp
log =1n
n∑i=1
s(yix>i b)s(−yix>i b)xix>i
Supervised Learning Logistic Regression
Logistic Regression
Second derivative is positive-definite: objective function is convex andthere is a single unique global minimum.Many different algorithms can find optimal b, e.g.:
Gradient descent:
bnew = b + ε1n
n∑i=1
s(−yix>i b)yixi
Stochastic gradient descent:
bnew = b + εt1|I(t)|
∑i∈I(t)
s(−yix>i b)yixi
where I(t) is a subset of the data at iteration t, and εt → 0 slowly(∑
t εt =∞,∑
t ε2t <∞).
Newton-Raphson:bnew = b− (∇2
bRemplog )−1∇bRemp
log
This is also called iterative reweighted least squares.Conjugate gradient, LBFGS and other methods from numerical analysis.
Supervised Learning Logistic Regression
Logistic Regression vs. LDA
Both have linear decision boundaries and model log-posterior odds as
logp(Y = +1|X = x)
p(Y = −1|X = x)= a + b>x
LDA models the marginal density of x as a Gaussian mixture with sharedcovariance
g(x) = π−1N (x;µ−1,Σ) + π+1N (x;µ+1,Σ)
and fits the parameters θ = (µ−1, µ+1, π−1, π+1,Σ) by maximizing jointlikelihood
∑ni=1 p(xi, yi|θ). a and b are then determined from θ.
Logistic regression leaves the marginal density g(x) as an arbitrarydensity function, and fits the parameters a,b by maximizing theconditional likelihood
∑ni=1 p(yi|xi; a, b).
Supervised Learning Logistic Regression
Logistic Regression
Properties of logistic regression:Makes less modelling assumptions than generative classifiers.A simple example of a generalised linear model (GLM). Much statisticaltheory:
assessment of fit via deviance and plots,interpretation of entries of b as odds-ratios,fitting categorical data (sometimes called multinomial logistic regression),well founded approaches to removing insignificant features (drop-indeviance test, Wald test).
Supervised Learning Model Complexity and Generalization
Model Complexity and Generalization
Supervised Learning Model Complexity and Generalization
Generalization
Generalization ability: what is the out-of-sample error of learner f ?training error 6= testing error.We learn f by minimizing some variant of empirical risk Remp(f )- what canwe say about the true risk R(f )?Two important factors determining generalization ability:
Model complexityTraining data size
Supervised Learning Model Complexity and Generalization
Learning Curves
Model complexity/flexibility
Pred
ictio
n er
ror
Underfit
Overfit
Just right
Training error
Testerror
Fixed dataset size, varying model complexity.
Supervised Learning Model Complexity and Generalization
Learning Curves
training error
testing error
training dataset size
overfit
pre
dic
tio
ner
ror
training error
testing error
training dataset size
overfit
pre
dic
tio
ner
ror
Fixed model complexity, varying dataset size.Two models: one simple, one complex. Which is which?
Supervised Learning Model Complexity and Generalization
Bias-Variance Tradeoff
Where does the prediction error come from?Example: Squared loss in regression: X = Rp, Y = R,
L(Y, f (X)) = (Y − f (X))2
Optimal f is the conditional mean:
f∗(x) = E [Y|X = x]
Follows from R(f ) = EXE[
(Y − f (X))2∣∣∣X] and
E[
(Y − f (X))2∣∣∣X = x
]= E
[Y2∣∣X = x
]− 2f (x)E [Y|X = x] + f (x)
2
= Var [Y|X = x] + (E [Y|X = x]− f (x))2.
Supervised Learning Model Complexity and Generalization
Bias-Variance Tradeoff
Optimal risk is the intrinsic conditional variability of Y (noise):
R(f∗) = EX [Var [Y|X]]
Excess risk:R(f )− R(f∗) = EX
[(f (X)− f∗(X))
2]
How does the excess risk behave on average?Consider a randomly selected dataset D = {(Xi,Yi)}n
i=1 and f (D) trainedon D using a model (hypothesis class) H.
ED[R(f (D))− R(f∗)
]= EDEX
[(f (D)(X)− f∗(X)
)2]
= EXED[(
f (D)(X)− f∗(X))2].
Supervised Learning Model Complexity and Generalization
Bias-Variance Tradeoff
Denote f (x) = EDf (D)(x) (average decision function over all possibledatasets)
ED[(
f (D)(X)− f∗(X))2]
= ED[(
f (D)(X)− f (X))2]
︸ ︷︷ ︸VarX(H,n)
+(f (X)− f∗(X)
)2︸ ︷︷ ︸Bias2
X(H,n)
Now,EDR(f (D)) = R(f∗) + EXVarX(H, n) + EXBias2
X(H, n)
Where does the prediction error come from?Noise: Intrinsic difficulty of regression problem.Bias: How far away is the best learner in the model (average learner overall possible datasets) from the optimal one?Variance: How variable is our learning method if given different datasets?
Supervised Learning Model Complexity and Generalization
Learning Curves
training error
testing error
training dataset size
overfit
pre
dic
tio
ner
ror
bias
variance
training error
testing error
training dataset sizep
red
icti
on
erro
r
bias
variance
overfit
Supervised Learning Model Complexity and Generalization
Learning Curve
Model complexity/flexibility
Pred
ictio
n er
ror
Underfit:high bias
low varianceOverfit:low bias
high variance
Just right
Training error
Testerror
Supervised Learning Artificial Neural Networks
Artificial Neural Networks
Supervised Learning Artificial Neural Networks
Biological inspiration
Basic computational elements:neurons.Receives signals from otherneurons via dendrites.Sends processed signals viaaxons.Axon-dendrite interactions atsynapses.1010 − 1011 neurons.1014 − 1015 synapses.
Supervised Learning Artificial Neural Networks
Single Neuron Classifier
x(1)
x(2)
x(3)
x(p)
b
b
b
∑
w1
w2
w3
wp
s(.)
P(Y = 1|X = x)
1
b
w⊤x+ b
activation w>x + b (linear in inputs x)activation/transfer function s gives the output/activity (potentiallynonlinear in x)common nonlinear activation function s(a) = 1
1+e−a : logistic regressionlearn w and b via gradient descent
Supervised Learning Artificial Neural Networks
Single Neuron Classifier
xi1
xi2
Supervised Learning Artificial Neural Networks
Overfitting
iterations 30,80
Figures from D. MacKay, Information Theory, Inference and Learning Algorithms
prevent overfitting by:early stopping: just halt the gradient descentregularization: L2-regularization called weight decay in neural networksliterature.
Supervised Learning Artificial Neural Networks
Overfitting
iterations 500,3000
Figures from D. MacKay, Information Theory, Inference and Learning Algorithms
prevent overfitting by:early stopping: just halt the gradient descentregularization: L2-regularization called weight decay in neural networksliterature.
Supervised Learning Artificial Neural Networks
Overfitting
iterations 10000,40000
Figures from D. MacKay, Information Theory, Inference and Learning Algorithms
prevent overfitting by:early stopping: just halt the gradient descentregularization: L2-regularization called weight decay in neural networksliterature.
Supervised Learning Artificial Neural Networks
Multilayer Networks
Data vectors xi ∈ Rp, binary labels yi ∈ {0, 1}.inputs xi1, . . . , xip
output yi = P(Y = 1|X = xi)
hidden unit activities hi1, . . . , him
Compute hidden unit activities:
hil = s
bhl +
p∑j=1
whjlxij
Compute output probability:
yi = s
(bo +
m∑l=1
wokhil
)xi1 xi2 xi3 xi4
hi1 hi2 hi3 hi4 hi5 hi6 hi7
yi
Supervised Learning Artificial Neural Networks
Multilayer Networks
xi1
xi2
Supervised Learning Artificial Neural Networks
Training a Neural Network
Objective function: L2-regularized log-loss
J = −n∑
i=1
yi log yi + (1− yi) log(1− yi) +λ
2
∑jl
(whjl)
2 +∑
l
(wol )2
where
yi = s
(bo +
m∑l=1
wol hil
)hil = s
bhl +
p∑j=1
whjlxij
Optimize parameters θ =
{bh,wh, bo,wo
}, where bh ∈ Rm, wh ∈ Rp×m,
bo ∈ R, wo ∈ Rm with gradient descent.
∂J∂wo
l= λwo
l +
n∑i=1
∂J∂yi
∂yi
∂wol
= λwol +
n∑i=1
(yi − yi)hil,
∂J∂wh
jl= λwh
jl +n∑
i=1
∂J∂yi
∂yi
∂hil
∂hil
∂whjl
= λwhjl +
n∑i=1
(yi − yi)wol hil(1− hil)xij.
L2-regularization often called weight decay.Multiple hidden layers: Backpropagation algorithm
Supervised Learning Artificial Neural Networks
Multiple hidden layers
yi = hL+1
i
b b b b b bhLi1 hL
im
b b b b b bh1i1 h1
im
b b b
xi1 = h0i1
xip = h0ip
h`+1i = s
(W`+1h`i
)W`+1 =
(w`jk)
jk: weight matrix at
the (`+ 1)-th layer, weight w`jk onthe edge between h`−1
ik and h`ijs: entrywise (logistic) transferfunction
yi = s(WL+1s
(WL (· · · s (W1xi
))))
Supervised Learning Artificial Neural Networks
Backpropagation
yi = hL+1
i
b b bhℓ+1
i1 hℓ+1
im
hℓij
hℓ−1
ik
wℓjk
b b b
b b b
b b b
J = −n∑
i=1
yi log hL+1i +(1−yi) log(1−hL+1
i )
Gradients wrt h`ij computed byrecursive applications of chainrule, and propagated through thenetwork backwards.
∂J∂hL+1
i
= − yi
hL+1i
+1− yi
1− hL+1i
∂J∂h`ij
=
m∑r=1
∂J∂h`+1
ir
∂h`+1ir
∂h`ij
∂J∂w`jk
=
n∑i=1
∂J∂h`ij
∂h`ij∂w`jk
Supervised Learning Artificial Neural Networks
Neural Networks
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x1
x2
Solution (global minimum)Local minimum 1Local minimum 2Local minimum 3
Global solution and local minima
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x1
x2
Neural network fit with a weight decay of 0.01
R package implementing neural networks with a single hidden layer: nnet.
Supervised Learning Artificial Neural Networks
Neural Networks – Discussion
Nonlinear hidden units introduce modelling flexibility.In contrast to user-introduced nonlinearities, features are global, and canbe learned to maximize predictive performance.Neural networks with a single hidden layer and sufficiently many hiddenunits can model arbitrarily complex functions.Optimization problem is not convex, and objective function can havemany local optima, plateaus and ridges.On large scale problems, often use stochastic gradient descent, alongwith a whole host of techniques for optimization, regularization, andinitialization.Recent developments, especially by Geoffrey Hinton, Yann LeCun,Yoshua Bengio, Andrew Ng and others. See alsohttp://deeplearning.net/.
Supervised Learning Artificial Neural Networks
Dropout Training of Neural Networks
Neural network with single layer of hiddenunits:
Hidden unit activations:
hik = s
bhk +
p∑j=1
Whjkxij
Output probability:
yi = s
(bo +
m∑k=1
Wok hik
)
Large, overfitted networks often haveco-adapted hidden units.What each hidden unit learns may in factbe useless, e.g. predicting the negation ofpredictions from other units.Can prevent co-adaptation by randomlydropping out units from network.
xi1 xi2 xi3 xi4
hi1 hi2 hi3 hi4 hi5 hi6 hi7
yi
Hinton et al (2012).
Supervised Learning Artificial Neural Networks
Dropout Training of Neural Networks
Model as an ensemble of networks:
p(yi = 1|xi, θ) =∑
b⊂{1,...,m}
q|b|(1− q)m−|b|p(yi = 1|xi, θ, drop out units b)
xi1 xi2 xi3 xi4
hi1 hi2 hi3 hi4
yi
xi1 xi2 xi3 xi4
hi1 hi3 hi5 hi7
yi
xi1 xi2 xi3 xi4
hi1 hi4 hi6
yi
xi1 xi2 xi3 xi4
hi3 hi4 hi5
yi
Weight-sharing among all networks: each network uses a subset of theparameters of the full network (corresponding to the retained units).Training by stochastic gradient descent: at each iteration a network issampled from ensemble, and its subset of parameters are updated.Biological inspiration: 1014 weights to be fitted in a lifetime of 109 seconds
Poisson spikes as a regularization mechanism which preventsco-adaptation: Geoff Hinton on Brains, Sex and Machine Learning
Supervised Learning Artificial Neural Networks
Dropout Training of Neural NetworksClassification of phonemes in speech.
Figure from Hinton et al.