Upload
lydieu
View
222
Download
0
Embed Size (px)
Citation preview
Bayesian Deep Learning
Seungjin Choi
Department of Computer Science and EngineeringPohang University of Science and Technology
77 Cheongam-ro, Nam-gu, Pohang 37673, [email protected]
http://mlg.postech.ac.kr/∼seungjin
December 2, 2016
1 / 44
Deep Learning
I Function composition
f ≈ σL ◦ σL−1 ◦ · · · ◦ σ1
I Fully-connected network (MLP)
h(l)t = σ
(W (l)h(l−1)
t + b(l)t
)I Convolutional neural network
h(l)t = σ
(w (l) ∗ h(l−1)
t + b(l)t
)I Recurrent neural network
h(l)t = σ
(W (l)h(l−1)
t + V (l)h(l)t−1 + b(l)
t
)
2 / 44
Deep Learning
I Function composition
f ≈ σL ◦ σL−1 ◦ · · · ◦ σ1
I Fully-connected network (MLP)
h(l)t = σ
(W (l)h(l−1)
t + b(l)t
)I Convolutional neural network
h(l)t = σ
(w (l) ∗ h(l−1)
t + b(l)t
)I Recurrent neural network
h(l)t = σ
(W (l)h(l−1)
t + V (l)h(l)t−1 + b(l)
t
)
2 / 44
Deep Learning
I Function composition
f ≈ σL ◦ σL−1 ◦ · · · ◦ σ1
I Fully-connected network (MLP)
h(l)t = σ
(W (l)h(l−1)
t + b(l)t
)I Convolutional neural network
h(l)t = σ
(w (l) ∗ h(l−1)
t + b(l)t
)I Recurrent neural network
h(l)t = σ
(W (l)h(l−1)
t + V (l)h(l)t−1 + b(l)
t
)
2 / 44
ImageNet Challenge
3 / 44
2015: A Milestone Year in Computer Science
AlexNet (2012)
I AlexNet (5 convolutionallayers + 3 fully connectedlayers), 2012
I VGG (very deep CNN, 16-19weight layers), 2015
I GoogLeNet (22 layers), 2015
I Deep Residual Net (100-1000layers), 2015
https://blogs.nvidia.com/blog/2016/01/12/accelerating-ai-artificial-intelligence-gpus/
4 / 44
2015: A Milestone Year in Computer Science
AlexNet (2012)
I AlexNet (5 convolutionallayers + 3 fully connectedlayers), 2012
I VGG (very deep CNN, 16-19weight layers), 2015
I GoogLeNet (22 layers), 2015
I Deep Residual Net (100-1000layers), 2015
https://blogs.nvidia.com/blog/2016/01/12/accelerating-ai-artificial-intelligence-gpus/
4 / 44
2015: A Milestone Year in Computer Science
AlexNet (2012)
I AlexNet (5 convolutionallayers + 3 fully connectedlayers), 2012
I VGG (very deep CNN, 16-19weight layers), 2015
I GoogLeNet (22 layers), 2015
I Deep Residual Net (100-1000layers), 2015
https://blogs.nvidia.com/blog/2016/01/12/accelerating-ai-artificial-intelligence-gpus/
4 / 44
Bayes + Deep = ?
I DeepI Good approximation of complex nonlinear transformI Deep hierarchy for representation learning
I BayesI Model comparisonI Predictive distribution (averaging likelihood w.r.t. posterior over
parameters)I Uncertainty
I Bayesian deep learning = combine the best of two approaches?
5 / 44
Bayes + Deep = ?
I DeepI Good approximation of complex nonlinear transformI Deep hierarchy for representation learning
I BayesI Model comparisonI Predictive distribution (averaging likelihood w.r.t. posterior over
parameters)I Uncertainty
I Bayesian deep learning = combine the best of two approaches?
5 / 44
Bayes + Deep = ?
I DeepI Good approximation of complex nonlinear transformI Deep hierarchy for representation learning
I BayesI Model comparisonI Predictive distribution (averaging likelihood w.r.t. posterior over
parameters)I Uncertainty
I Bayesian deep learning = combine the best of two approaches?
5 / 44
Bayes + Deep = ?
I DeepI Good approximation of complex nonlinear transformI Deep hierarchy for representation learning
I BayesI Model comparisonI Predictive distribution (averaging likelihood w.r.t. posterior over
parameters)I Uncertainty
I Bayesian deep learning = combine the best of two approaches?
5 / 44
CNN +Bayesian Model Comparison
6 / 44
Why CNN so successful?
I Similar to simple and complex cells in V1 area of visual cortex
I Deep architecture
I Supervised representation learning
LeNet, 1989
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
7 / 44
Why CNN so successful?
I Similar to simple and complex cells in V1 area of visual cortex
I Deep architecture
I Supervised representation learning
LeNet, 1989
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to
document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
7 / 44
Pre-Trained CNNs
I AlexNet (5 convolutional layers + 3 fully connected layers): A.
Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification wit deep convolutional
neural networks. In NIPS, volume 25, 2012
I VGG (very deep CNN, 16-19 weight layers): K. Simonyan and A. Zisserman.
Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
I GoogLeNet (22 layers): C. Szegedy, W. Liu, Y. Jia, P. Sermanet, D. A. S. Reed, D.
Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015
I Deep Residual Net (100-1000 layers): K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for Image recognition. arXiv:1512.03385, 2015
8 / 44
Pre-Trained CNNs
I AlexNet (5 convolutional layers + 3 fully connected layers): A.
Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification wit deep convolutional
neural networks. In NIPS, volume 25, 2012
I VGG (very deep CNN, 16-19 weight layers): K. Simonyan and A. Zisserman.
Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
I GoogLeNet (22 layers): C. Szegedy, W. Liu, Y. Jia, P. Sermanet, D. A. S. Reed, D.
Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015
I Deep Residual Net (100-1000 layers): K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for Image recognition. arXiv:1512.03385, 2015
8 / 44
Pre-Trained CNNs
I AlexNet (5 convolutional layers + 3 fully connected layers): A.
Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification wit deep convolutional
neural networks. In NIPS, volume 25, 2012
I VGG (very deep CNN, 16-19 weight layers): K. Simonyan and A. Zisserman.
Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
I GoogLeNet (22 layers): C. Szegedy, W. Liu, Y. Jia, P. Sermanet, D. A. S. Reed, D.
Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015
I Deep Residual Net (100-1000 layers): K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for Image recognition. arXiv:1512.03385, 2015
8 / 44
Pre-Trained CNNs
I AlexNet (5 convolutional layers + 3 fully connected layers): A.
Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification wit deep convolutional
neural networks. In NIPS, volume 25, 2012
I VGG (very deep CNN, 16-19 weight layers): K. Simonyan and A. Zisserman.
Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
I GoogLeNet (22 layers): C. Szegedy, W. Liu, Y. Jia, P. Sermanet, D. A. S. Reed, D.
Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015
I Deep Residual Net (100-1000 layers): K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for Image recognition. arXiv:1512.03385, 2015
8 / 44
Bayesian Model ComparisonModel selection: Choose the single most probable model:
p(D|Mi ) =
∫p(D|w ,Mi )p(w |Mi )dw .
p(D)
DD0
M1
M2
M3
I M1 is the simplest andM3 is the most complex.
I For the particular observed data set D0, the modelM2 with intermediatecomplexity has the largest evidence.
9 / 44
Bayesian Model ComparisonModel selection: Choose the single most probable model:
p(D|Mi ) =
∫p(D|w ,Mi )p(w |Mi )dw .
p(D)
DD0
M1
M2
M3
I M1 is the simplest andM3 is the most complex.
I For the particular observed data set D0, the modelM2 with intermediatecomplexity has the largest evidence.
9 / 44
Bayesian Linear RegressionI Gaussian likelihood: p(y |X ,w) =
∏Ni=1N (yi |x>i w , β−1I ).
I Gaussian prior: p(w) = N (w |0, α−1I ).I Posterior is Gaussian of the form: p(w |y ,X ) = N (w |µN ,Λ
−1N ), where
µN = βΛ−1N Xy , ΛN = αI + βXX>.
I Marginal likelihood (evidence) is given by
L(α, β) = log p(y |X , α, β) = log
∫p(y |X ,w , β)p(w |α)dw
=D
2logα+
N
2log β − β
2‖y − X>µN‖
2 − α
2µ>NµN
−1
2|ΛN | −
N
2log 2π.
I Fixed point updates for hyperparameters α and β:
α =γ
µ>NµN
, β =N − γ
‖y − X>µN‖2,
where
γ =D∑
d=1
βsdα+ βsd
, (sd are eigenvalues of XX>).
10 / 44
Bayesian Linear RegressionI Gaussian likelihood: p(y |X ,w) =
∏Ni=1N (yi |x>i w , β−1I ).
I Gaussian prior: p(w) = N (w |0, α−1I ).I Posterior is Gaussian of the form: p(w |y ,X ) = N (w |µN ,Λ
−1N ), where
µN = βΛ−1N Xy , ΛN = αI + βXX>.
I Marginal likelihood (evidence) is given by
L(α, β) = log p(y |X , α, β) = log
∫p(y |X ,w , β)p(w |α)dw
=D
2logα+
N
2log β − β
2‖y − X>µN‖
2 − α
2µ>NµN
−1
2|ΛN | −
N
2log 2π.
I Fixed point updates for hyperparameters α and β:
α =γ
µ>NµN
, β =N − γ
‖y − X>µN‖2,
where
γ =D∑
d=1
βsdα+ βsd
, (sd are eigenvalues of XX>).
10 / 44
Bayesian Linear RegressionI Gaussian likelihood: p(y |X ,w) =
∏Ni=1N (yi |x>i w , β−1I ).
I Gaussian prior: p(w) = N (w |0, α−1I ).I Posterior is Gaussian of the form: p(w |y ,X ) = N (w |µN ,Λ
−1N ), where
µN = βΛ−1N Xy , ΛN = αI + βXX>.
I Marginal likelihood (evidence) is given by
L(α, β) = log p(y |X , α, β) = log
∫p(y |X ,w , β)p(w |α)dw
=D
2logα+
N
2log β − β
2‖y − X>µN‖
2 − α
2µ>NµN
−1
2|ΛN | −
N
2log 2π.
I Fixed point updates for hyperparameters α and β:
α =γ
µ>NµN
, β =N − γ
‖y − X>µN‖2,
where
γ =D∑
d=1
βsdα+ βsd
, (sd are eigenvalues of XX>).
10 / 44
Bayesian Linear RegressionI Gaussian likelihood: p(y |X ,w) =
∏Ni=1N (yi |x>i w , β−1I ).
I Gaussian prior: p(w) = N (w |0, α−1I ).I Posterior is Gaussian of the form: p(w |y ,X ) = N (w |µN ,Λ
−1N ), where
µN = βΛ−1N Xy , ΛN = αI + βXX>.
I Marginal likelihood (evidence) is given by
L(α, β) = log p(y |X , α, β) = log
∫p(y |X ,w , β)p(w |α)dw
=D
2logα+
N
2log β − β
2‖y − X>µN‖
2 − α
2µ>NµN
−1
2|ΛN | −
N
2log 2π.
I Fixed point updates for hyperparameters α and β:
α =γ
µ>NµN
, β =N − γ
‖y − X>µN‖2,
where
γ =D∑
d=1
βsdα+ βsd
, (sd are eigenvalues of XX>).
10 / 44
Bayesian Approach: Evidence p(D|Mi)
I Select a model with maximum evidence
I Select a subset of pre-trained CNNs in a greedy manner
11 / 44
Fast Bayesian Learning
Yong-Deok Kim, Taewoong Jang, Bohyung Han, and Seungjin Choi(2016), ”Learning to select pre-trained deep representations withBayesian evidence framework,” in CVPR-2016. (oral)
12 / 44
13 / 44
Linear Generative Models
14 / 44
Linear Generative Models
x = Az + ε
I Factor analysis: Spherical Gaussian Prior
I Independent component analysis: Independent Non-Gaussian Prior
I Nonnegative matrix factorization: Nonnegative Prior
15 / 44
Linear Generative Models
x = Az + ε
I Factor analysis: Spherical Gaussian Prior
I Independent component analysis: Independent Non-Gaussian Prior
I Nonnegative matrix factorization: Nonnegative Prior
15 / 44
ICA
Assume that {zi} are statistically independent.
16 / 44
17 / 44
NMF
(a) (b)
Lee and Seung (1999), ”Learning the parts of objects by non-negative matrix factorization,” Nature
18 / 44
Multiplicative Up-Prop
Ahn, Oh, Choi, ”A multiplicative up-propagation algorithm,” ICML-2004
19 / 44
(a) (b) (c)
20 / 44
Deep Generative ModelsGenerative Models + Deep Network
21 / 44
Back-Prop (discriminative)Learning with labeled data
Up-Prop (generative)Learning with unlabeled data
Oh and Seung, ”Learning Generative Models with the Up Propagation Algorithm,” NIPS-1997
22 / 44
Back-Prop (discriminative)Learning with labeled data
Up-Prop (generative)Learning with unlabeled data
Oh and Seung, ”Learning Generative Models with the Up Propagation Algorithm,” NIPS-1997
22 / 44
Deep Directed Generative Models
I Probabilistic decoder: pθ(x |z)
I Inference: p(z |x)
I Density network:pθ(x |z) = g(x |z , θ) [MacKay and
Gibbs]
I Variational autoencoder:introduce inference networkqφ(z |x) = f (z |x , φ) [Kingma and
Welling]
https://www.openai.com/blog/generative-
models/
23 / 44
Deep Directed Generative Models
I Probabilistic decoder: pθ(x |z)
I Inference: p(z |x)
I Density network:pθ(x |z) = g(x |z , θ) [MacKay and
Gibbs]
I Variational autoencoder:introduce inference networkqφ(z |x) = f (z |x , φ) [Kingma and
Welling]
https://www.openai.com/blog/generative-
models/
23 / 44
Deep Directed Generative Models
I Probabilistic decoder: pθ(x |z)
I Inference: p(z |x)
I Density network:pθ(x |z) = g(x |z , θ) [MacKay and
Gibbs]
I Variational autoencoder:introduce inference networkqφ(z |x) = f (z |x , φ) [Kingma and
Welling]
https://www.openai.com/blog/generative-
models/
23 / 44
Variational Lower-Bound
log p(x) ≥∫
q(z |x) logp(x , z)
q(z |x)dz
=
∫q(z |x) log
p(x |z)p(z)
q(z |x)dz
= Eq(z |x )
[log p(x |z)
]︸ ︷︷ ︸
Reconstruction
−KL [q(z |x)‖p(z)]︸ ︷︷ ︸Penalty
,
I Reconstruction cost: The expected log-likelihood measures how wellsamples from q(z |x) are able to explain the data x .
I Penalty: The approximation q(z |x) to the posterior does not deviatetoo far from your beliefs p(z).
24 / 44
Stochastic Gradient Variational Bayes
F(x , φ) = Eq
[log p(x |z)
]︸ ︷︷ ︸
SGVB
− KL [q(z |x)‖p(z)]︸ ︷︷ ︸analytically computed
,
where Eq[·] denotes the expectation w.r.t. q(z |x) and Monte Carloestimates are performed with the reparameterization trick:
Eq [log p(x |z)] ≈ 1
L
L∑
l=1
log p(x |z (l)),
where z (l) = m +√λ� ε(l) and ε(l) ∼ N (0, I ). A single sample is often
sufficient to form this Monte Carlo estimates in practice
25 / 44
Jonhson et al. (2016), ”Structured VAEs: Composing Probabilistic Graphical Models and
Variational Autoencoders,” Preprint arXiv:1603.06277
26 / 44
Data Imputation
Taken from Shakir Mohamed’s slides
27 / 44
Image Generation
Taken from Shakir Mohamed’s slides
28 / 44
Semi-Supervised Learning
Small number of labeled examples with a plenty of unlabeled examples
Taken from Wikipedia
29 / 44
Semi-Supervised Learning
Small number of labeled examples with a plenty of unlabeled examples
Taken from Wikipedia
29 / 44
Semi-Supervised Learning
Small number of labeled examples with a plenty of unlabeled examples
Taken from Wikipedia
29 / 44
VAE with Rank-One Covariance [Suh and Choi, 2016]
𝑋1
𝑋2
𝑍𝑧(3) 𝑧(5) 𝑧(4) 𝑧(1)
𝑎(𝑧(4))
𝜇(𝑧(4)) 𝜇(𝑧(1)) 𝜇(𝑧(6))
𝑎(𝑧(1)) 𝑎(𝑧(6))
𝑎(𝑧(3)) 𝑎(𝑧(5))
𝜇(𝑧(3)) 𝜇(𝑧(5))
𝑧(6) 𝑧(2)
𝑎(𝑧(2))𝜇(𝑧(2))
I Find local principal directionat a specific location µ(z):
p(x |z) = N(µ, ωI + aa>
),
p(z) = N (0, I ),µ = W µh + bµ,
logω = w>ω h + bω,
a = W ah + ba,
h = tanh(W hz + bh).
I Can be interpreted as infinitemixture of PPCA(p(s) = N (0, 1)):
p(x |s, z) = N (as + µ, ωI )
Suwon Suh and Seungjin Choi (2016), ”Gaussian copula variational autoencoders for mixed data,”
Preprint arXiv:1604.04960
30 / 44
(a) True images (b) Generated images
31 / 44
-1.5 -1 -0.5 0 0.5 1 1.5-0.5
0
0.5
1
1.5
(a) Data
-1.5 -1 -0.5 0 0.5 1 1.5-0.5
0
0.5
1
1.5
(b) VAE
32 / 44
-1.5 -1 -0.5 0 0.5 1 1.5-0.5
0
0.5
1
1.5
(a) VAE-ROC without regularization
-15 -10 -5 0 5 10 15-5
0
5
10
15
(b) VAE-ROC with L2 norm reg.λlocal = 5
33 / 44
Copulas
I A D-dimensional copula C is a distribution function on unit cube[0, 1]D with each univariate marginal distribution being uniform on[0, 1]
I Classical result of Sklar (1959) (xi ’s are continuous)
F (x1, . . . , xD) = C(F1(x1), . . . ,FD(xD)
)
p(x1, . . . , xD) = c(F1(x1), . . . ,FD(xD)
) D∏
i=1
pi (xi )
I Define ui = Fi (xi ) ∈ [0, 1], i = 1, . . . ,D, then we have
C(u1, . . . , uD
)= F (F−1
1 (u1), . . . ,F−1D (uD))
c(u1, . . . , uD) =∂C (u1, . . . , uD)
∂u1 · · · ∂uD(copula density)
34 / 44
Gaussian Copula
I The Gaussian copula with covariance matrix Σ ∈ RD×D is given by
CΦ(u1, . . . , uD) = ΦΣ
(Φ−1(u1), . . . ,Φ−1(uD) |Σ
),
where ΦΣ(· |Σ) is the D-dimensional Gaussian CDF with covariancematrix Σ with diagonal entries being equal to one and Φ(·) is theunivariate standard Gaussian CDF.
I The Gaussian copula density is given by
cΦ(u1, . . . , uD) =∂DCΦ(u1, . . . , uD)
∂u1 · · · ∂uD
= |Σ|− 12 exp
{−1
2q>(Σ−1 − I )q
},
where q = [q1, . . . , qD ]> with normal scores qi = Φ−1(ui ) fori = 1, . . . ,D.
35 / 44
Invoking the result of Sklar with this Gaussian copula density,the joint density function is written as:
p(x) = |Σ|− 12 exp
{−1
2q>(Σ−1 − I )q
} D∏
i=1
pi (xi ).
36 / 44
Continuous Extension: Discrete Variables
I When xi ’s are discrete, the copula CΦ(u1, . . . , uD) are uniquelydetermined on the range of F1 × · · · × FD .
I The joint probability mass function (PMF) of x1, . . . , xD is given by
p(x1, . . . , xD) =2∑
j1=1
· · ·2∑
jD=1
(−1)j1+···jD ΦΣ
(Φ−1(u1,j1 ), . . . ,Φ−1(uD,jD )
),
where ui,1 = Fi (x−i ), the limit of Fi (·) at xi from the left, and
ui,2 = Fi (xi ).
I The PMF requires the evaluation of 2D terms, which is notmanageable even for a moderate value of D (for instance, D ≥ 5).
I A continuous extension (CE) of discrete random variables xi avoidsthe D-fold summation in the above equation, associating acontinuous random variable x∗i = xi − vi with the integer-valued xi ,where vi is uniform on [0,1] and is independent of xi as well as of ρjfor j 6= i .
37 / 44
I Continuous random variables x∗i produced by jittering xi yields theCDF and PDF given by
F ∗i (ξ) = Fi ([ξ]) + (ξ − [ξ])P (xi = [ξ + 1]) ,
p∗i (ξ) = P (xi = [ξ + 1]) ,
where [ξ] represents the nearest integer less than or equal to ξ.
I The joint PFM for x1, . . . , xD is given by
p(x1, . . . , xD) = Ev
[|Σ|− 1
2 exp
{−1
2q∗>(Σ−1 − I )q∗
} D∏
i=1
p∗i (xi − vi )
]
q∗ =[Φ−1(F ∗1 (x1 − v1)), . . . ,Φ−1(F ∗D(xD − vD))
]>.
38 / 44
Gaussian Copula VAE
z(n)
η(n) τ (n)
N
y(n)
xc,(n)i
µ(n)i σ
(n)i
h(n)
xs,(n)i
ds
β(n)i
dc
q̃c,(n)
i q̃s,(n)
ia(n) ω(n)
vs,(n)i
39 / 44
Attributes are mixed categorical (ordinal) and real-valued.
Table: Approximated test log-likelihood on UCI Auto dataset.
UCI Auto (10K) UCI SPECT (10K)VAE −200.289± 3.751 −144.195± 3.443GCVAE −189.344± 4.599 −134.579± 2.621
40 / 44
Anomaly Detection
0 100 200 300 400 500 600 700 800 900 1000-100
-50
0
50
100
(a) toy data
0 100 200 300 400 500 600 700 800 900 10000
5
10
15
20
25
(b) anomaly scores
-2 0 2-3
-2
-1
0
1
2
3
(c) latent space
r(i−1) x(i)
z(i)
N(d) our model
41 / 44
Anomaly Detection
0 100 200 300 400 500 600 700 800 900 1000-100
-50
0
50
100
(a) toy data
0 100 200 300 400 500 600 700 800 900 10000
5
10
15
20
25
(b) anomaly scores
-2 0 2-3
-2
-1
0
1
2
3
(c) latent space
r(i−1) x(i)
z(i)
N(d) our model
41 / 44
Anomaly Detection
0 100 200 300 400 500 600 700 800 900 1000-100
-50
0
50
100
(a) toy data
0 100 200 300 400 500 600 700 800 900 10000
5
10
15
20
25
(b) anomaly scores
-2 0 2-3
-2
-1
0
1
2
3
(c) latent space
r(i−1) x(i)
z(i)
N(d) our model
41 / 44
Echo State Networks [Herbert Jaeger, 2002]
I An approach to recurrent neural network trainingI Consists of a large, fixed, recurrent ”reservoir” network:
r (i) = αr (i−1) + (1− α)f(Ar (i−1) + B
[1; Λxx (i)
]),
where A and B are NOT trained but only properly initialized.I The network output y (i) is computed by training suitable output
connection weights C :
y (i) = C[1; x (i); r (i)
].
42 / 44
Our Model: ES-CVAE [Suh and Choi, 2016]
r(i−1) x(i)
z(i)
N
Thejoint distribution over a sequence of N instances,p(x1, x2, . . . , xN) = p(x1)
∏Nn=2 p(xn|x1:n−1),
is modeled as
p(x1, x2, . . . , xN) = p(x1)N∏
n=2
p(xn|rn−1),
where
p(xn|rn−1) =
∫p(xn|zn, rn−1)p(zn|rn−1)dzn,
and reservoir states rn−1 are computed by
rn−1 = αrn−2 + (1− α)f (Arn−2 + B [1; Λxxn−1]) ,
Anomaly score of xn = − log p(xn|rn−1)
43 / 44
Our Model: ES-CVAE [Suh and Choi, 2016]
r(i−1) x(i)
z(i)
N
Thejoint distribution over a sequence of N instances,p(x1, x2, . . . , xN) = p(x1)
∏Nn=2 p(xn|x1:n−1),
is modeled as
p(x1, x2, . . . , xN) = p(x1)N∏
n=2
p(xn|rn−1),
where
p(xn|rn−1) =
∫p(xn|zn, rn−1)p(zn|rn−1)dzn,
and reservoir states rn−1 are computed by
rn−1 = αrn−2 + (1− α)f (Arn−2 + B [1; Λxxn−1]) ,
Anomaly score of xn = − log p(xn|rn−1)
43 / 44
Summary
I A quick overview of deep learning
I Pre-trained CNN + Bayesian model comparison
I Deep directed generative models
I Gaussian copula variational autoencoders
I Echo-state conditional variational autoencoders
44 / 44
Summary
I A quick overview of deep learning
I Pre-trained CNN + Bayesian model comparison
I Deep directed generative models
I Gaussian copula variational autoencoders
I Echo-state conditional variational autoencoders
44 / 44
Summary
I A quick overview of deep learning
I Pre-trained CNN + Bayesian model comparison
I Deep directed generative models
I Gaussian copula variational autoencoders
I Echo-state conditional variational autoencoders
44 / 44
Summary
I A quick overview of deep learning
I Pre-trained CNN + Bayesian model comparison
I Deep directed generative models
I Gaussian copula variational autoencoders
I Echo-state conditional variational autoencoders
44 / 44
Summary
I A quick overview of deep learning
I Pre-trained CNN + Bayesian model comparison
I Deep directed generative models
I Gaussian copula variational autoencoders
I Echo-state conditional variational autoencoders
44 / 44