Transcript
Page 1: Concrete Dropout - Alex Kendall - Home · Yarin Gal1,2,3, Jiri Hron1, Alex Kendall1 1: Department of Engineering, University of Cambridge, UK 2: Alan Turing Institute, UK, 3: Department

Concrete DropoutYarin Gal1,2,3, Jiri Hron1, Alex Kendall1

1: Department of Engineering, University of Cambridge, UK 2: Alan Turing Institute, UK, 3: Department of Computer Science, University of Oxford, UK

Motivation

•Dropout probabilities have significanteffect on predictive performance

• Traditional grid search or manual tuning isprohibitively expensive for large models

•Optimisation wrt a sensible objectiveshould result in better calibrated uncer-tainty, and shorter experiment cycle

• Useful for large modern models in machinevision and reinforcement learning

Background

• Gal and Gharamani (2015) reinterpreted dropout regularisation asapproximate inference in BNNs

•Dropout probabilities pl are variational parameters of the approx-imate posterior qθ(ω) =

∏k qMk,pk(Wk), where Wk = Mk ·

diag (zk) and zkliid∼Bernoulli(1− pk)

• Concrete distribution (Maddison et al., Jang et al.) relaxesCategorical distribution to obtain gradients wrt the probability vector

– Example: zlkiid∼Bernoulli(1 − pk) is replaced by z̃kl =

sigmoid ((log pk1−pk + log ukl

1−ukl)/t) where ukliid∼Uniform(0, 1)

Learning dropout probabilities

SVI (Hoffman et al., 2013) can be used to approximate the posterior:

L̂MC(θ) = −1

M

∑i∈S

log p(yi |fωθ(xi)) +

1

NKL (qθ(ω)‖ p(ω))

Structure of qθ(ω) turns calculation of the KL into a sum over:

KL (qMk,pk(Wk)‖ p(Wk)) ∝l2(1− pk)

2‖Mk‖2F − Kk+1H (pk)

H (pk) := −pk log pk − (1− pk) log(1− pk)

101 102 103 104

Number of data points (N)

0.2

0.3

0.4

0.5

Drop

out p

roba

bilit

y

Layer #1Layer #2Layer #3

1 0 1

6

8

10

12

4 2 0 2 40

10

20

101 102 103 104

Number of data points (N)

0.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

Epist

emic

unce

rtain

ty (s

td)

101 102 103 104

Number of data points (N)

0.000.250.500.751.001.251.501.752.00

Alea

toric

unc

erta

inty

(std

)101 102 103 104

Number of data points (N)

0.000.250.500.751.001.251.501.752.00

Pred

ictiv

e un

certa

inty

(std

)

Properties:

• For large Kl, H (pk) pushes pk → 0.5, maximising entropy

• Large‖Mk‖2F forces pk to 1, i.e. to drop all weights

• As N→∞, KL is ignored and posterior concentrates at MLE

Simple to implement in Keras (Chollet et al., 2015):# regularisation

...

kernel_regularizer = self.weight_regularizer * K.sum(K.square(weight))

dropout_regularizer = self.p * K.log(self.p) + (1.-self.p) * K.log(1.-self.p)

dropout_regularizer *= self.dropout_regularizer * input_dim

regularizer = K.sum(kernel_regularizer + dropout_regularizer)

self.add_loss(regularizer)

...

# forward pass

...

u = K.random_uniform(shape=K.shape(x))

z = K.log(self.p / (1. - self.p)) + K.log(u / (1-u))

z = K.sigmoid(z / temp)

x *= 1. - z

...

Application to image segmentation

Epistemic and aleatoric uncertainty in machine vision:

Image segmentation using Bayesian SegNet (Kendall et al., 2015)

Converged probabilities were robust to random initialisation:

0 5000 10000 15000 20000 25000 30000 35000 40000

Training Iterations

0.0

0.1

0.2

0.3

0.4

0.5

Concrete dropout p

0 5000 10000 15000 20000 25000 30000 35000 40000

Training Iterations

0.0

0.1

0.2

0.3

0.4

0.5

Concrete dropout p

0 5000 10000 15000 20000 25000 30000 35000 40000

Training Iterations

0.0

0.1

0.2

0.3

0.4

0.5

Concrete dropout p

0 5000 10000 15000 20000 25000 30000 35000 40000

Training Iterations

0.0

0.1

0.2

0.3

0.4

0.5

Concrete dropout p

0 5000 10000 15000 20000 25000 30000 35000 40000

Training Iterations

0.0

0.1

0.2

0.3

0.4

0.5

Concrete dropout p

And compare favourably to expensively hand-tuned setting:

DenseNet Model Variant MC Sampling IoU

No Dropout - 65.8Dropout (manually-tuned p = 0.2) 7 67.1Dropout (manually-tuned p = 0.2) 3 67.2Concrete Dropout 7 67.2Concrete Dropout 3 67.4

Comparing the performance against baseline models with DenseNet on

the CamVid road scene semantic segmentation dataset

0.0 0.2 0.4 0.6 0.8 1.0

Probability

0.0

0.2

0.4

0.6

0.8

1.0

Frequency

No Dropout, MSE = 0.0401

Dropout, MSE = 0.0316

Concrete Dropout, MSE = 0.0296

Reduced uncertainty calibration RMSE

Conclusion and future research

• Tuning of dropout probabilities even for very large models

• Better calibrated uncertainty estimates

• RL: epistemic uncertainty will vanish as more data acquired

Recommended