Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Concrete DropoutYarin Gal1,2,3, Jiri Hron1, Alex Kendall1
1: Department of Engineering, University of Cambridge, UK 2: Alan Turing Institute, UK, 3: Department of Computer Science, University of Oxford, UK
Motivation
•Dropout probabilities have significanteffect on predictive performance
• Traditional grid search or manual tuning isprohibitively expensive for large models
•Optimisation wrt a sensible objectiveshould result in better calibrated uncer-tainty, and shorter experiment cycle
• Useful for large modern models in machinevision and reinforcement learning
Background
• Gal and Gharamani (2015) reinterpreted dropout regularisation asapproximate inference in BNNs
•Dropout probabilities pl are variational parameters of the approx-imate posterior qθ(ω) =
∏k qMk,pk(Wk), where Wk = Mk ·
diag (zk) and zkliid∼Bernoulli(1− pk)
• Concrete distribution (Maddison et al., Jang et al.) relaxesCategorical distribution to obtain gradients wrt the probability vector
– Example: zlkiid∼Bernoulli(1 − pk) is replaced by z̃kl =
sigmoid ((log pk1−pk + log ukl
1−ukl)/t) where ukliid∼Uniform(0, 1)
Learning dropout probabilities
SVI (Hoffman et al., 2013) can be used to approximate the posterior:
L̂MC(θ) = −1
M
∑i∈S
log p(yi |fωθ(xi)) +
1
NKL (qθ(ω)‖ p(ω))
Structure of qθ(ω) turns calculation of the KL into a sum over:
KL (qMk,pk(Wk)‖ p(Wk)) ∝l2(1− pk)
2‖Mk‖2F − Kk+1H (pk)
H (pk) := −pk log pk − (1− pk) log(1− pk)
101 102 103 104
Number of data points (N)
0.2
0.3
0.4
0.5
Drop
out p
roba
bilit
y
Layer #1Layer #2Layer #3
1 0 1
6
8
10
12
4 2 0 2 40
10
20
101 102 103 104
Number of data points (N)
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
Epist
emic
unce
rtain
ty (s
td)
101 102 103 104
Number of data points (N)
0.000.250.500.751.001.251.501.752.00
Alea
toric
unc
erta
inty
(std
)101 102 103 104
Number of data points (N)
0.000.250.500.751.001.251.501.752.00
Pred
ictiv
e un
certa
inty
(std
)
Properties:
• For large Kl, H (pk) pushes pk → 0.5, maximising entropy
• Large‖Mk‖2F forces pk to 1, i.e. to drop all weights
• As N→∞, KL is ignored and posterior concentrates at MLE
Simple to implement in Keras (Chollet et al., 2015):# regularisation
...
kernel_regularizer = self.weight_regularizer * K.sum(K.square(weight))
dropout_regularizer = self.p * K.log(self.p) + (1.-self.p) * K.log(1.-self.p)
dropout_regularizer *= self.dropout_regularizer * input_dim
regularizer = K.sum(kernel_regularizer + dropout_regularizer)
self.add_loss(regularizer)
...
# forward pass
...
u = K.random_uniform(shape=K.shape(x))
z = K.log(self.p / (1. - self.p)) + K.log(u / (1-u))
z = K.sigmoid(z / temp)
x *= 1. - z
...
Application to image segmentation
Epistemic and aleatoric uncertainty in machine vision:
Image segmentation using Bayesian SegNet (Kendall et al., 2015)
Converged probabilities were robust to random initialisation:
0 5000 10000 15000 20000 25000 30000 35000 40000
Training Iterations
0.0
0.1
0.2
0.3
0.4
0.5
Concrete dropout p
0 5000 10000 15000 20000 25000 30000 35000 40000
Training Iterations
0.0
0.1
0.2
0.3
0.4
0.5
Concrete dropout p
0 5000 10000 15000 20000 25000 30000 35000 40000
Training Iterations
0.0
0.1
0.2
0.3
0.4
0.5
Concrete dropout p
0 5000 10000 15000 20000 25000 30000 35000 40000
Training Iterations
0.0
0.1
0.2
0.3
0.4
0.5
Concrete dropout p
0 5000 10000 15000 20000 25000 30000 35000 40000
Training Iterations
0.0
0.1
0.2
0.3
0.4
0.5
Concrete dropout p
And compare favourably to expensively hand-tuned setting:
DenseNet Model Variant MC Sampling IoU
No Dropout - 65.8Dropout (manually-tuned p = 0.2) 7 67.1Dropout (manually-tuned p = 0.2) 3 67.2Concrete Dropout 7 67.2Concrete Dropout 3 67.4
Comparing the performance against baseline models with DenseNet on
the CamVid road scene semantic segmentation dataset
0.0 0.2 0.4 0.6 0.8 1.0
Probability
0.0
0.2
0.4
0.6
0.8
1.0
Frequency
No Dropout, MSE = 0.0401
Dropout, MSE = 0.0316
Concrete Dropout, MSE = 0.0296
Reduced uncertainty calibration RMSE
Conclusion and future research
• Tuning of dropout probabilities even for very large models
• Better calibrated uncertainty estimates
• RL: epistemic uncertainty will vanish as more data acquired