Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution

Statistical Machine Learning: Introduction

Dino SejdinovicDepartment of Statistics

University of Oxford

22-24 June 2015, Novi Sadslides available at:

http://www.stats.ox.ac.uk/~sejdinov/talks.html

http://www.stats.ox.ac.uk/~sejdinov/talks.html

Introduction Introduction

What is Machine Learning?

Arthur Samuel, 1959Field of study that gives computers the ability to learn without being explicitlyprogrammed.

Tom Mitchell, 1997Any computer program that improves its performance at some task throughexperience.

Kevin Murphy, 2012

To develop methods that can automatically detect patterns in data, andthen to use the uncovered patterns to predict future data or other outcomesof interest.





Kevin Murphy, 2012






Kevin Murphy, 2012




data

InformationStructurePredictionDecisionsActions

http://gureckislab.org Larry Page about DeepMind’s ML systems that can learn to play video games like humans

http://gureckislab.org

https://www.youtube.com/watch?v=mArrNRWQEso&t=2m35s



Machine Learning

statistics

computerscience

cognitivescience

psychology

mathematics

engineeringoperationsresearch

physics

biologygenetics

businessfinance


What is Data Science?

’Data Scientists’ Meld Statistics and Software WSJ article

http://www.wsj.com/articles/academic-researchers-find-lucrative-work-as-big-data-scientists-1407543088


Information Revolution

Traditional Statistical Inference ProblemsWell formulated question that we would like to answer.Expensive data gathering and/or expensive computation.Create specially designed experiments to collect high quality data.

Information RevolutionImprovements in data processing and data storage.Powerful, cheap, easy data capturing.Lots of (low quality) data with potentially valuable information inside.


Statistics and Machine Learning in the age of Big Data

ML becoming a thorough blending of computer science, engineering andstatisticsunified framework of data, inferences, procedures, algorithms

statistics taking computation seriouslycomputing taking statistical risk seriously

scale and granularity of datapersonalization, societal and business impactmultidisciplinarity - and you are the interdisciplinary glueit’s just getting started

Michael Jordan: On the Computational and Statistical Interface and "Big Data"

http://techtalks.tv/talks/on-the-computational-and-statistical-interface-and-big-data/60848/

Introduction Applications of Machine Learning

Applications of Machine Learning

spam filteringrecommendation

systemsfraud detection

self-driving carsimage recognition

stock market analysis

ImageNet: Krizhevsky et al, 2012; Machine Learning is Eating the World: Forbes article

http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf

http://www.forbes.com/sites/85broads/2014/01/06/six-novel-machine-learning-applications/

Introduction Types of Machine Learning

Types of Machine Learning

Supervised learning

Data contains “labels”: every example is an input-output pairclassification, regressionGoal: prediction on new examples

Unsupervised learning

Extract key features of the “unlabelled” dataclustering, signal separation, density estimationGoal: representation, hypothesis generation, visualization


Types of Machine Learning

Semi-supervised Learning

A database of examples, only a small subset of which are labelled.

Multi-task Learning

A database of examples, each of which has multiple labels corresponding todifferent prediction tasks.

Reinforcement Learning

An agent acting in an environment, given rewards for performing appropriateactions, learns to maximize their reward.


Literature

Supervised Learning Supervised Learning

Supervised Learning

We observe the input-output pairs {(xi, yi)}ni=1, xi ∈ X , yi ∈ Y

Types of supervised learning:Classification: discrete responses, e.g. Y = {+1,−1} or {1, . . . ,K}.Regression: a numerical value is observed and Y = R.

The goal is to accurately predict the response Y on new observations of X,i.e., to learn a function f : Rp → Y, such that f (X) will be close to the trueresponse Y.

Supervised Learning Decision Theory

Loss function

Suppose we made a prediction Y = f (X) ∈ Y based on observation of X.How good is the prediction? We can use a loss function L : Y ×Y 7→ R+

to formalize the quality of the prediction.Typical loss functions:

Misclassification loss (or 0-1 loss) for classification

L(Y, f (X)) =

{0 f (X) = Y1 f (X) 6= Y

.

Squared loss for regression

L(Y, f (X)) = (f (X)− Y)2 .

Many other choices are possible, e.g., weighted misclassification loss.In classification, the vector of estimated probabilities (p(k))k∈Y can bereturned, and in this case log-likelihood loss (or log loss)L(Y, p) = − log p(Y) is often used.


Risk

Paired observations {(xi, yi)}ni=1 viewed as i.i.d. realizations of a random

variable (X,Y) on X × Y with joint distribution PXY

RiskFor a given loss function L, the risk R of a learned function f is given by theexpected loss

R(f ) = EPXY [L(Y, f (X))] ,

where the expectation is with respect to the true (unknown) joint distribution of(X,Y).

The risk is unknown, but we can compute the empirical risk:

Rn(f ) =1n

n∑i=1

L(yi, f (xi)).


The Bayes Classifier

What is the optimal classifier if the joint distribution (X,Y) were known?The density g of X can be written as a mixture of K components(corresponding to each of the classes):

g(x) =

K∑k=1

πkgk(x),

where, for k = 1, . . . ,K,P(Y = k) = πk are the class probabilities,gk(x) is the conditional density of X, given Y = k.

The Bayes classifier fBayes : x 7→ {1, . . . ,K} is the one with minimum risk:

R(f ) =E [L(Y, f (X))] = EX[EY|X[L(Y, f (X))|X]

]=

∫XE [L(Y, f (X))|X = x] g(x)dx

The minimum risk attained by the Bayes classifier is called Bayes risk.Minimizing E[L(Y, f (X))|X = x] separately for each x suffices.


The Bayes Classifier

Consider the 0-1 loss.The risk simplifies to:

E[L(Y, f (X))

∣∣X = x]

=

K∑k=1

L(k, f (x))P(Y = k|X = x)

=1− P(Y = f (x)|X = x)

The risk is minimized by choosing the class with the greatest posteriorprobability:

fBayes(x) = arg maxk=1,...,K

P(Y = k|X = x)

= arg maxk=1,...,K

πkgk(x)∑Kj=1 πjgj(x)

= arg maxk=1,...,K

πkgk(x).

The functions x 7→ πkgk(x) are called discriminant functions. Thediscriminant function with maximum value determines the predicted classof x.


The Bayes Classifier: ExampleA simple two Gaussians example: Suppose X ∼ N (µY , 1), where µ1 = −1 andµ2 = 1 and assume equal priors π1 = π2 = 1/2.

g1(x) =1√2π

exp(− (x + 1)2

2

)and g2(x) =

1√2π

exp(− (x− 1)2

2

).

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

x

cond

ition

al d

ensi

ties

−3 −2 −1 0 1 2 3

0.05

0.10

0.15

0.20

0.25

x

mar

gina

l den

sity

Optimal classification is fBayes(x) = arg maxk=1,...,K

πkgk(x) =

{1 if x < 0,2 if x ≥ 0.


The Bayes Classifier: Example

How do you classify a new observation x if now the standard deviation is still 1for class 1 but 1/3 for class 2?

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x

cond

ition

al d

ensi

ties

−3 −2 −1 0 1 2 3

1e−

321e

−25

1e−

181e

−11

1e−

04

x

cond

ition

al d

ensi

ties

Looking at density in a log-scale, optimal classification is to select class 2 ifand only if x ∈ [0.34, 2.16].


Plug-in Classification

The Bayes Classifier chooses the class with the greatest posteriorprobability

fBayes(x) = arg maxk=1,...,K

πkgk(x).

We know neither the conditional densities gk nor the class probabilities πk!The plug-in classifier chooses the class

f (x) = arg maxk=1,...,K

πkgk(x),

where we plugged inestimates πk of πk and k = 1, . . . ,K andestimates gk(x) of conditional densities,

Supervised Learning Linear Discriminant Analysis

Linear Discriminant Analysis

LDA is the simplest example of plug-in classification.Assume multivariate normal conditional density gk(x) for each class k:

X|Y = k ∼N (µk,Σ),

gk(x) =(2π)−p/2|Σ|−1/2 exp(−1

2(x− µk)

>Σ−1(x− µk)

),

each class can have a different mean µk,all classes share the same covariance Σ.

For an observation x, the k-th log-discriminant function is

logπkgk(x) = const + logπk −12

(x− µk)>Σ−1(x− µk)

= const + ak + b>k x

linear log-discriminant functions↔ linear decision boundaries.The quantity (x− µk)

>Σ−1(x− µk) is the squared Mahalanobis distancebetween x and µk.If Σ = Ip and πk = 1

K , LDA simply chooses the class k with the nearest (inthe Euclidean sense) class mean.

Supervised Learning Linear Discriminant Analysis

Iris Dataset

●●● ●●

●

●

●●

●

●●

●●

●

●●

● ●●

●

●

●

●

●●

●

●● ●●

●

●

●●●●

●

● ●

●●

●

●

●

●

●●●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

1 2 3 4 5 6 7

0.5

1.0

1.5

2.0

2.5

Petal.Length

Pet

al.W

idth

Supervised Learning Statistical Learning Theory

Generative vs Discriminative Learning

Generative learning: find parameters which explain all the dataavailable.

θ = argmaxθ

n∑i=1

log P(xi, yi|θ)

Examples: LDA, naïve Bayes.Makes use of all the data available.Flexible framework, can incorporate other tasks, incomplete data.Stronger modelling assumptions.

Discriminative learning: find parameters that aid in prediction.

θ = argminθ

1n

n∑i=1

L(yi, fθ(xi)) or θ = argmaxθ

n∑i=1

log P(yi|xi, θ)

Examples: logistic regression, support vector machines.Typically performs better on a given task.Weaker modelling assumptions.Can overfit more easily.


Generative Learning

We work with a joint distribution PX,Y(x, y) over data vectors and labels.A learning algorithm: construct f : X → Y which predicts the label of X.Given a loss function L, the risk R of f (X) is

R(f ) = EX,Y [L(Y, f (X))]

For 0/1 loss in classification, Bayes classifier

fBayes(x) = argmaxk=1,...,K

P(Y = k|x) = argmaxk=1,...,K

PX,Y(x, k)

has the minimum risk (Bayes risk), but is unknown since PX,Y is unknown.Assume a parameteric model for the joint: PX,Y(x, y) = PX,Y(x, y|θ)Fit θ = argmaxθ

∑ni=1 log P(xi, yi|θ) and plug in back to Bayes classifier:

f (x) = argmaxk=1,...,K

PX,Y(x, k|θ).


Hypothesis space and Empirical Risk Minimization

Find best function in H minimizing the risk:

f? = argminf∈H

EX,Y [L(Y, f (X))]

Empirical Risk Minimization (ERM): minimize the empirical risk instead,since we typically do not know PX,Y .

f = argminf∈H

1n

n∑i=1

L(yi, f (xi))

Hypothesis space H is the space of functions f under consideration.How complex should we allow functions f to be? If hypothesis space H is“too large”, ERM will overfit. Function

f (x) =

{yi if x = xi,

0 otherwise

will have zero empirical risk, but is useless for generalization, since it hassimply “memorized” the dataset.


Training and Test Performance

Training error is the empirical risk

1n

n∑i=1

L(yi, f (xi))

For 0-1 loss in classification, this is the misclassification error on thetraining data {xi, yi}n

i=1, which were used in learning f .Test error is the empirical risk on new, previously unseen observations{xi, yi}m

i=1

1m

m∑i=1

L(yi, f (xi))

which were NOT used in learning f .Test error is a much better gauge of how well learned functiongeneralizes to new data.The test error is in general larger than the training error.


Hypothesis space for two-class LDA

Assume we have two classes {+1,−1}.Recall that the discriminant functions in LDA are linear. Assuming thatdata vectors in class k is modelled as N (µk,Σ), choosing class +1 over−1 involves:

a+1 + b>+1x > a−1 + b>−1x ⇔ a? + b>? x > 0,

where a? = a+1 − a−1, b? = b+1 − b−1.Thus, hypothesis space of two-class LDA consists of functionsf (x) = sign(a + b>x).We obtain coefficients a and b, and thus the function f through fitting theparameters of the generative model.Discriminative learning: restrict H to a class of functionsf (x) = sign(a + b>x) and select a and b which minimize empirical risk.


Space of linear decision functions

Hypothesis space H = {f : f (x) = sign(a + b>x), a ∈ R, b ∈ Rp}Find a, b that minimize the empirical risk under 0-1 loss:

argmina,b

1n

n∑i=1

L(yi, fa,b(xi))

= argmina,b

1n

n∑i=1

{0 if yi = sign(a + b>xi)

1 otherwise

= argmina,b

12n

n∑i=1

[1− sign(yi(a + b>xi))

].

Combinatorial problem - not typically possible to solve...Maybe easier with a different loss function? (Logistic regression)

Supervised Learning Logistic Regression

Linearity of log-posterior odds

Another way to express linear decision boundary of LDA:

logp(Y = +1|X = x)

p(Y = −1|X = x)= a + b>x.

Solve explicitly for conditional class probabilities:

p(Y = +1|X = x) =1

1 + exp(−(a + b>x))=: s(a + b>x)

p(Y = −1|X = x) =1

1 + exp(+(a + b>x))= s(−a− b>x)

where s(·) is the logistic function

−8 −6 −4 −2 0 2 4 6 80

0.5

1


Logistic Regression

Consider maximizing the conditional log likelihood:

`(a, b) =

n∑i=1

log p(Y = yi|X = xi) =

n∑i=1

− log(1 + exp(−yi(a + b>xi)))

Equivalent to minimizing the empirical risk associated with the log loss:

Remplog =

1n

n∑i=1

log(1 + exp(−yi(a + b>xi))) =1n

n∑i=1

− log(s(yi(a + b>xi))︸︷︷︸pa,b(yi|xi)

)

−4 −3 −2 −1 0 1 2 3 40

1

2

3

4

Log Loss (log(1+exp(−y(a+b’x))))

0−1 Loss (1/2+sign(−y(a+b’x))/2)


Logistic Regression

Not possible to find optimal a, b analytically.For simplicitiy, absorb a as an entry in b byappending ’1’ into x vector.Objective function:

Remplog =

1n

n∑i=1

− log s(yix>i b)

Logistic Function

s(−z) = 1− s(z)

∇zs(z) = s(z)s(−z)

∇z log s(z) = s(−z)

∇2z log s(z) = −s(z)s(−z)

Differentiate wrt b:

∇bRemplog =

1n

n∑i=1

−s(−yix>i b)yixi

∇2bRemp

log =1n

n∑i=1

s(yix>i b)s(−yix>i b)xix>i


Logistic Regression

Second derivative is positive-definite: objective function is convex andthere is a single unique global minimum.Many different algorithms can find optimal b, e.g.:

Gradient descent:

bnew = b + ε1n

n∑i=1

s(−yix>i b)yixi

Stochastic gradient descent:

bnew = b + εt1|I(t)|

∑i∈I(t)

s(−yix>i b)yixi

where I(t) is a subset of the data at iteration t, and εt → 0 slowly(∑

t εt =∞,∑

t ε2t <∞).

Newton-Raphson:bnew = b− (∇2

bRemplog )−1∇bRemp

log

This is also called iterative reweighted least squares.Conjugate gradient, LBFGS and other methods from numerical analysis.


Logistic Regression vs. LDA

Both have linear decision boundaries and model log-posterior odds as

logp(Y = +1|X = x)

p(Y = −1|X = x)= a + b>x

LDA models the marginal density of x as a Gaussian mixture with sharedcovariance

g(x) = π−1N (x;µ−1,Σ) + π+1N (x;µ+1,Σ)

and fits the parameters θ = (µ−1, µ+1, π−1, π+1,Σ) by maximizing jointlikelihood

∑ni=1 p(xi, yi|θ). a and b are then determined from θ.

Logistic regression leaves the marginal density g(x) as an arbitrarydensity function, and fits the parameters a,b by maximizing theconditional likelihood

∑ni=1 p(yi|xi; a, b).


Logistic Regression

Properties of logistic regression:Makes less modelling assumptions than generative classifiers.A simple example of a generalised linear model (GLM). Much statisticaltheory:

assessment of fit via deviance and plots,interpretation of entries of b as odds-ratios,fitting categorical data (sometimes called multinomial logistic regression),well founded approaches to removing insignificant features (drop-indeviance test, Wald test).

Supervised Learning Model Complexity and Generalization

Model Complexity and Generalization


Generalization

Generalization ability: what is the out-of-sample error of learner f ?training error 6= testing error.We learn f by minimizing some variant of empirical risk Remp(f )- what canwe say about the true risk R(f )?Two important factors determining generalization ability:

Model complexityTraining data size


Learning Curves

Model complexity/flexibility

Pred

ictio

n er

ror

Underfit

Overfit

Just right

Training error

Testerror

Fixed dataset size, varying model complexity.


Learning Curves

training error

testing error

training dataset size

overfit

pre

dic

tio

ner

ror

training error

testing error


overfit

pre

dic

tio

ner

ror

Fixed model complexity, varying dataset size.Two models: one simple, one complex. Which is which?


Bias-Variance Tradeoff

Where does the prediction error come from?Example: Squared loss in regression: X = Rp, Y = R,

L(Y, f (X)) = (Y − f (X))2

Optimal f is the conditional mean:

f∗(x) = E [Y|X = x]

Follows from R(f ) = EXE[

(Y − f (X))2∣∣∣X] and

E[

(Y − f (X))2∣∣∣X = x

]= E

[Y2∣∣X = x

]− 2f (x)E [Y|X = x] + f (x)

2

= Var [Y|X = x] + (E [Y|X = x]− f (x))2.



Optimal risk is the intrinsic conditional variability of Y (noise):

R(f∗) = EX [Var [Y|X]]

Excess risk:R(f )− R(f∗) = EX

[(f (X)− f∗(X))

2]

How does the excess risk behave on average?Consider a randomly selected dataset D = {(Xi,Yi)}n

i=1 and f (D) trainedon D using a model (hypothesis class) H.

ED[R(f (D))− R(f∗)

]= EDEX

[(f (D)(X)− f∗(X)

)2]

= EXED[(

f (D)(X)− f∗(X))2].



Denote f (x) = EDf (D)(x) (average decision function over all possibledatasets)

ED[(

f (D)(X)− f∗(X))2]

= ED[(

f (D)(X)− f (X))2]

︸︷︷︸VarX(H,n)

+(f (X)− f∗(X)

)2︸︷︷︸Bias2

X(H,n)

Now,EDR(f (D)) = R(f∗) + EXVarX(H, n) + EXBias2

X(H, n)

Where does the prediction error come from?Noise: Intrinsic difficulty of regression problem.Bias: How far away is the best learner in the model (average learner overall possible datasets) from the optimal one?Variance: How variable is our learning method if given different datasets?


Learning Curves

training error

testing error


overfit

pre

dic

tio

ner

ror

bias

variance

training error

testing error

training dataset sizep

red

icti

on

erro

r

bias

variance

overfit


Learning Curve

Model complexity/flexibility

Pred

ictio

n er

ror

Underfit:high bias

low varianceOverfit:low bias

high variance

Just right

Training error

Testerror

Supervised Learning Artificial Neural Networks

Artificial Neural Networks


Biological inspiration

Basic computational elements:neurons.Receives signals from otherneurons via dendrites.Sends processed signals viaaxons.Axon-dendrite interactions atsynapses.1010 − 1011 neurons.1014 − 1015 synapses.


Single Neuron Classifier

x(1)

x(2)

x(3)

x(p)

b

b

b

∑

w1

w2

w3

wp

s(.)

P(Y = 1|X = x)

1

b

w⊤x+ b

activation w>x + b (linear in inputs x)activation/transfer function s gives the output/activity (potentiallynonlinear in x)common nonlinear activation function s(a) = 1

1+e−a : logistic regressionlearn w and b via gradient descent


Single Neuron Classifier

xi1

xi2


Overfitting

iterations 30,80

Figures from D. MacKay, Information Theory, Inference and Learning Algorithms

prevent overfitting by:early stopping: just halt the gradient descentregularization: L2-regularization called weight decay in neural networksliterature.

http://www.inference.phy.cam.ac.uk/mackay/itila/book.html


Overfitting

iterations 500,3000





Overfitting

iterations 10000,40000





Multilayer Networks

Data vectors xi ∈ Rp, binary labels yi ∈ {0, 1}.inputs xi1, . . . , xip

output yi = P(Y = 1|X = xi)

hidden unit activities hi1, . . . , him

Compute hidden unit activities:

hil = s

bhl +

p∑j=1

whjlxij

Compute output probability:

yi = s

(bo +

m∑l=1

wokhil

)xi1 xi2 xi3 xi4

hi1 hi2 hi3 hi4 hi5 hi6 hi7

yi


Multilayer Networks

xi1

xi2


Training a Neural Network

Objective function: L2-regularized log-loss

J = −n∑

i=1

yi log yi + (1− yi) log(1− yi) +λ

2

∑jl

(whjl)

2 +∑

l

(wol )2

where

yi = s

(bo +

m∑l=1

wol hil

)hil = s

bhl +

p∑j=1

whjlxij

Optimize parameters θ =

{bh,wh, bo,wo

}, where bh ∈ Rm, wh ∈ Rp×m,

bo ∈ R, wo ∈ Rm with gradient descent.

∂J∂wo

l= λwo

l +

n∑i=1

∂J∂yi

∂yi

∂wol

= λwol +

n∑i=1

(yi − yi)hil,

∂J∂wh

jl= λwh

jl +n∑

i=1

∂J∂yi

∂yi

∂hil

∂hil

∂whjl

= λwhjl +

n∑i=1

(yi − yi)wol hil(1− hil)xij.

L2-regularization often called weight decay.Multiple hidden layers: Backpropagation algorithm


Multiple hidden layers

yi = hL+1

i

b b b b b bhLi1 hL

im

b b b b b bh1i1 h1

im

b b b

xi1 = h0i1

xip = h0ip

h`+1i = s

(W`+1hì

)W`+1 =

(w`jk)

jk: weight matrix at

the (`+ 1)-th layer, weight w`jk onthe edge between h`−1

ik and hìjs: entrywise (logistic) transferfunction

yi = s(WL+1s

(WL (· · · s (W1xi

))))


Backpropagation

yi = hL+1

i

b b bhℓ+1

i1 hℓ+1

im

hℓij

hℓ−1

ik

wℓjk

b b b

b b b

b b b

J = −n∑

i=1

yi log hL+1i +(1−yi) log(1−hL+1

i )

Gradients wrt hìj computed byrecursive applications of chainrule, and propagated through thenetwork backwards.

∂J∂hL+1

i

= − yi

hL+1i

+1− yi

1− hL+1i

∂J∂hìj

=

m∑r=1

∂J∂h`+1

ir

∂h`+1ir

∂hìj

∂J∂w`jk

=

n∑i=1

∂J∂hìj

∂hìj∂w`jk


Neural Networks

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

x2

Solution (global minimum)Local minimum 1Local minimum 2Local minimum 3

Global solution and local minima

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

x2

Neural network fit with a weight decay of 0.01

R package implementing neural networks with a single hidden layer: nnet.


Neural Networks – Discussion

Nonlinear hidden units introduce modelling flexibility.In contrast to user-introduced nonlinearities, features are global, and canbe learned to maximize predictive performance.Neural networks with a single hidden layer and sufficiently many hiddenunits can model arbitrarily complex functions.Optimization problem is not convex, and objective function can havemany local optima, plateaus and ridges.On large scale problems, often use stochastic gradient descent, alongwith a whole host of techniques for optimization, regularization, andinitialization.Recent developments, especially by Geoffrey Hinton, Yann LeCun,Yoshua Bengio, Andrew Ng and others. See alsohttp://deeplearning.net/.

https://www.cs.toronto.edu/~hinton/

http://yann.lecun.com/

http://www.iro.umontreal.ca/~bengioy/yoshua_en/index.html

http://cs.stanford.edu/people/ang/

http://deeplearning.net/


Dropout Training of Neural Networks

Neural network with single layer of hiddenunits:

Hidden unit activations:

hik = s

bhk +

p∑j=1

Whjkxij

Output probability:

yi = s

(bo +

m∑k=1

Wok hik

)

Large, overfitted networks often haveco-adapted hidden units.What each hidden unit learns may in factbe useless, e.g. predicting the negation ofpredictions from other units.Can prevent co-adaptation by randomlydropping out units from network.

xi1 xi2 xi3 xi4

hi1 hi2 hi3 hi4 hi5 hi6 hi7

yi

Hinton et al (2012).

http://arxiv.org/abs/1207.0580


Dropout Training of Neural Networks

Model as an ensemble of networks:

p(yi = 1|xi, θ) =∑

b⊂{1,...,m}

q|b|(1− q)m−|b|p(yi = 1|xi, θ, drop out units b)

xi1 xi2 xi3 xi4

hi1 hi2 hi3 hi4

yi

xi1 xi2 xi3 xi4

hi1 hi3 hi5 hi7

yi

xi1 xi2 xi3 xi4

hi1 hi4 hi6

yi

xi1 xi2 xi3 xi4

hi3 hi4 hi5

yi

Weight-sharing among all networks: each network uses a subset of theparameters of the full network (corresponding to the retained units).Training by stochastic gradient descent: at each iteration a network issampled from ensemble, and its subset of parameters are updated.Biological inspiration: 1014 weights to be fitted in a lifetime of 109 seconds

Poisson spikes as a regularization mechanism which preventsco-adaptation: Geoff Hinton on Brains, Sex and Machine Learning

https://www.youtube.com/watch?v=DleXA5ADG78


Dropout Training of Neural NetworksClassification of phonemes in speech.

Figure from Hinton et al.

Documents

Statistical Machine Learning: Introductionsejdinov/talks/pdf/2015-06-22_NoviSad.pdf · Supervised Learning Statistical Learning Theory Generative Learning We work with a joint distribution