1
Concrete Dropout Yarin Gal 1,2,3 , Jiri Hron 1 , Alex Kendall 1 1: Department of Engineering, University of Cambridge, UK 2: Alan Turing Institute, UK, 3: Department of Computer Science, University of Oxford, UK Motivation Dropout probabilities have significant effect on predictive performance Traditional grid search or manual tuning is prohibitively expensive for large models Optimisation wrt a sensible objective should result in better calibrated uncer- tainty, and shorter experiment cycle Useful for large modern models in machine vision and reinforcement learning Background Gal and Gharamani (2015) reinterpreted dropout regularisation as approximate inference in BNNs Dropout probabilities p l are variational parameters of the approx- imate posterior q θ (ω ) = Q k q M k ,p k (W k ), where W k = M k · diag (z k ) and z kl iid Bernoulli(1 - p k ) Concrete distribution (Maddison et al., Jang et al.) relaxes Categorical distribution to obtain gradients wrt the probability vector Example: z lk iid Bernoulli(1 - p k ) is replaced by ˜ z kl = sigmoid ((log p k 1-p k + log u kl 1-u kl )/t) where u kl iid Uniform(0, 1) Learning dropout probabilities SVI (Hoffman et al., 2013) can be used to approximate the posterior: b L MC (θ )= - 1 M X iS log p(y i | f ω θ (x i )) + 1 N KL (q θ (ω )k p(ω )) Structure of q θ (ω ) turns calculation of the KL into a sum over: KL (q M k ,p k (W k )k p(W k )) l 2 (1 - p k ) 2 kM k k 2 F - K k +1 H (p k ) H (p k ) : = -p k log p k - (1 - p k ) log(1 - p k ) Properties: For large K l , H (p k ) pushes p k 0.5, maximising entropy Large kM k k 2 F forces p k to 1, i.e. to drop all weights As N →∞, KL is ignored and posterior concentrates at MLE Simple to implement in Keras (Chollet et al., 2015): # regularisation ... kernel_regularizer = self.weight_regularizer * K.sum(K.square(weight)) dropout_regularizer = self.p * K.log(self.p) + (1.-self.p) * K.log(1.-self.p) dropout_regularizer *= self.dropout_regularizer * input_dim regularizer = K.sum(kernel_regularizer + dropout_regularizer) self.add_loss(regularizer) ... # forward pass ... u = K.random_uniform(shape=K.shape(x)) z = K.log(self.p / (1. - self.p)) + K.log(u / (1-u)) z = K.sigmoid(z / temp) x *= 1. - z ... Application to image segmentation Epistemic and aleatoric uncertainty in machine vision: Image segmentation using Bayesian SegNet (Kendall et al., 2015) Converged probabilities were robust to random initialisation: And compare favourably to expensively hand-tuned setting: DenseNet Model Variant MC Sampling IoU No Dropout - 65.8 Dropout (manually-tuned p =0.2) 67.1 Dropout (manually-tuned p =0.2) 67.2 Concrete Dropout 67.2 Concrete Dropout 67.4 Comparing the performance against baseline models with DenseNet on the CamVid road scene semantic segmentation dataset Reduced uncertainty calibration RMSE Conclusion and future research Tuning of dropout probabilities even for very large models Better calibrated uncertainty estimates RL: epistemic uncertainty will vanish as more data acquired

Concrete Dropout - Alex Kendall - Home · Yarin Gal1,2,3, Jiri Hron1, Alex Kendall1 1: Department of Engineering, University of Cambridge, UK 2: Alan Turing Institute, UK, 3: Department

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Concrete Dropout - Alex Kendall - Home · Yarin Gal1,2,3, Jiri Hron1, Alex Kendall1 1: Department of Engineering, University of Cambridge, UK 2: Alan Turing Institute, UK, 3: Department

Concrete DropoutYarin Gal1,2,3, Jiri Hron1, Alex Kendall1

1: Department of Engineering, University of Cambridge, UK 2: Alan Turing Institute, UK, 3: Department of Computer Science, University of Oxford, UK

Motivation

•Dropout probabilities have significanteffect on predictive performance

• Traditional grid search or manual tuning isprohibitively expensive for large models

•Optimisation wrt a sensible objectiveshould result in better calibrated uncer-tainty, and shorter experiment cycle

• Useful for large modern models in machinevision and reinforcement learning

Background

• Gal and Gharamani (2015) reinterpreted dropout regularisation asapproximate inference in BNNs

•Dropout probabilities pl are variational parameters of the approx-imate posterior qθ(ω) =

∏k qMk,pk(Wk), where Wk = Mk ·

diag (zk) and zkliid∼Bernoulli(1− pk)

• Concrete distribution (Maddison et al., Jang et al.) relaxesCategorical distribution to obtain gradients wrt the probability vector

– Example: zlkiid∼Bernoulli(1 − pk) is replaced by z̃kl =

sigmoid ((log pk1−pk + log ukl

1−ukl)/t) where ukliid∼Uniform(0, 1)

Learning dropout probabilities

SVI (Hoffman et al., 2013) can be used to approximate the posterior:

L̂MC(θ) = −1

M

∑i∈S

log p(yi |fωθ(xi)) +

1

NKL (qθ(ω)‖ p(ω))

Structure of qθ(ω) turns calculation of the KL into a sum over:

KL (qMk,pk(Wk)‖ p(Wk)) ∝l2(1− pk)

2‖Mk‖2F − Kk+1H (pk)

H (pk) := −pk log pk − (1− pk) log(1− pk)

101 102 103 104

Number of data points (N)

0.2

0.3

0.4

0.5

Drop

out p

roba

bilit

y

Layer #1Layer #2Layer #3

1 0 1

6

8

10

12

4 2 0 2 40

10

20

101 102 103 104

Number of data points (N)

0.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

Epist

emic

unce

rtain

ty (s

td)

101 102 103 104

Number of data points (N)

0.000.250.500.751.001.251.501.752.00

Alea

toric

unc

erta

inty

(std

)101 102 103 104

Number of data points (N)

0.000.250.500.751.001.251.501.752.00

Pred

ictiv

e un

certa

inty

(std

)

Properties:

• For large Kl, H (pk) pushes pk → 0.5, maximising entropy

• Large‖Mk‖2F forces pk to 1, i.e. to drop all weights

• As N→∞, KL is ignored and posterior concentrates at MLE

Simple to implement in Keras (Chollet et al., 2015):# regularisation

...

kernel_regularizer = self.weight_regularizer * K.sum(K.square(weight))

dropout_regularizer = self.p * K.log(self.p) + (1.-self.p) * K.log(1.-self.p)

dropout_regularizer *= self.dropout_regularizer * input_dim

regularizer = K.sum(kernel_regularizer + dropout_regularizer)

self.add_loss(regularizer)

...

# forward pass

...

u = K.random_uniform(shape=K.shape(x))

z = K.log(self.p / (1. - self.p)) + K.log(u / (1-u))

z = K.sigmoid(z / temp)

x *= 1. - z

...

Application to image segmentation

Epistemic and aleatoric uncertainty in machine vision:

Image segmentation using Bayesian SegNet (Kendall et al., 2015)

Converged probabilities were robust to random initialisation:

0 5000 10000 15000 20000 25000 30000 35000 40000

Training Iterations

0.0

0.1

0.2

0.3

0.4

0.5

Concrete dropout p

0 5000 10000 15000 20000 25000 30000 35000 40000

Training Iterations

0.0

0.1

0.2

0.3

0.4

0.5

Concrete dropout p

0 5000 10000 15000 20000 25000 30000 35000 40000

Training Iterations

0.0

0.1

0.2

0.3

0.4

0.5

Concrete dropout p

0 5000 10000 15000 20000 25000 30000 35000 40000

Training Iterations

0.0

0.1

0.2

0.3

0.4

0.5

Concrete dropout p

0 5000 10000 15000 20000 25000 30000 35000 40000

Training Iterations

0.0

0.1

0.2

0.3

0.4

0.5

Concrete dropout p

And compare favourably to expensively hand-tuned setting:

DenseNet Model Variant MC Sampling IoU

No Dropout - 65.8Dropout (manually-tuned p = 0.2) 7 67.1Dropout (manually-tuned p = 0.2) 3 67.2Concrete Dropout 7 67.2Concrete Dropout 3 67.4

Comparing the performance against baseline models with DenseNet on

the CamVid road scene semantic segmentation dataset

0.0 0.2 0.4 0.6 0.8 1.0

Probability

0.0

0.2

0.4

0.6

0.8

1.0

Frequency

No Dropout, MSE = 0.0401

Dropout, MSE = 0.0316

Concrete Dropout, MSE = 0.0296

Reduced uncertainty calibration RMSE

Conclusion and future research

• Tuning of dropout probabilities even for very large models

• Better calibrated uncertainty estimates

• RL: epistemic uncertainty will vanish as more data acquired