Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
A Unified Approach for Learning theParameters of Sum-Product Networks
Han Zhao†, Pascal Poupart? and Geoff Gordon††han.zhao, [email protected], [email protected]†Machine Learning Department, Carnegie Mellon University
?David R. Cheriton School of Computer Science, University of Waterloo
Introduction
• We present a unified approach for learning the parameters ofSum-Product networks (SPNs).
• We construct a more efficient factorization of complete anddecomposable SPN into a mixture of trees, with each tree being aproduct of univariate distributions.
• We show that the MLE problem for SPNs can be formulated as asignomial program.
• We construct two parameter learning algorithms for SPNs by usingsequential monomial approximations (SMA) and theconcave-convex procedure (CCCP). Both SMA and CCCP admitmultiplicative weight updates.
• We prove the convergence of CCCP on SPNs.
BackgroundSum-Product Networks (SPNs):• Rooted directed acyclic graph of univariate distributions, sum nodesand product nodes.
• We focus on discrete SPNs, but the proposed algorithms work forcontinuous ones as well.
Recursive computation of the network:
Vk(x | w) =
p(Xi = xi) k is a leaf node over Xi∏j∈Ch(k) Vj(x | w) k is a product node∑j∈Ch(k)wkjVj(x | w) k is a sum node
Scope: The set of variables that have univariate distributions amongthe node’s descendants.Complete: An SPN is complete iff each sum node has children withthe same scope.Decomposable: An SPN is decomposable iff for every product nodev, scope(vi)
⋂ scope(vj) = ∅ where vi, vj ∈ Ch(v), i 6= j.
A Unified Framework for Learning(SPNs as a Mixture of Trees) Theorem 1:Every complete and decomposable SPN S can be factorized into a sumof Ω(2h) induced trees (sub-graphs), where each tree corresponds to aproduct of univariate distributions. h is the height of S.
+
× × ×
Ix1 Ix1 Ix2 Ix2
w1 w2w3
= w1
+
×
Ix1 Ix2
+w2
+
×
Ix1 Ix2
+w3
+
×
Ix1 Ix2
Maximum Likelihood Estimation as Signomial Program:The MLE optimization is:
maximizewfS(x|w)fS(1|w)
=∑τt=1
∏Nn=1 I(t)
xn
∏Dd=1w
Iwd∈Ttd∑τ
t=1∏Dd=1w
Iwd∈Ttd
subject to w ∈ RD++
τ = Ω(2h). D is the number of parameters in S. N is the number ofrandom variables modeled by S. Tt is the t-th induced tree.Proposition 2:The MLE problem for SPNs is a signomial program.Logarithmic transformation leads to a difference of convex functions:
maximize logτ (x)∑t=1
exp D∑d=1
ydIyd∈Tt
− log
τ∑t=1
exp D∑d=1
ydIyd∈Tt
Sequential Monomial Approximation (SMA): Optimal linear ap-proximation in log-space, corresponds to the optimal monomial func-tion approximation to the original signomial.Concave-Convex Procedure (CCCP): Sequential convex relaxationby linearizing the first term, with efficient O(|S|) closed form solverfor each convex sub-problem.
(Convergence of CCCP) Theorem 2:Let w(k)∞k=1 be any sequence generated by CCCP from any feasibleinitial point. Then all the limiting points of w(k)∞k=1 are stationarypoints of the difference of convex functions program (DCP). In addi-tion, limk→∞ f (y(k)) = f (y∗), where y∗ is some stationary point of theDCP, i.e., the sequence of objective function values converges.
Algo Update Type Update FormulaPGD Additive w
(k+1)d ← PRε
++w(k)d + γ∇wdf (w(k))
EG Multiplicative w(k+1)d ← w
(k)d expγ∇wdf (w(k))
SMA Multiplicative w(k+1)d ← w
(k)d expγw(k)
d ∇wdf (w(k))CCCP Multiplicative w
(k+1)ij ∝ w
(k)ij · ∇vifS(w(k)) · fvj(w(k))
Experiments
0 10 20 30 40 50
Iterations
5
6
7
8
9
10
11
12
−lo
gPr(
x|w
)
NLTCSPGDEGSMACCCP
0 10 20 30 40 50
Iterations
6
7
8
9
10
11
12
13
−lo
gPr(
x|w
)
MSNBCPGDEGSMACCCP
0 10 20 30 40 50
Iterations
0
10
20
30
40
50
−lo
gPr(
x|w
)
KDD 2000PGDEGSMACCCP
0 10 20 30 40 50
Iterations
10
15
20
25
30
35
40
45
50
55
−lo
gPr(
x|w
)
PlantsPGDEGSMACCCP
0 10 20 30 40 50
Iterations
35
40
45
50
55
60
65
70
75
−lo
gPr(
x|w
)
AudioPGDEGSMACCCP
0 10 20 30 40 50
Iterations
50
55
60
65
70
75
−lo
gPr(
x|w
)
JesterPGDEGSMACCCP
0 10 20 30 40 50
Iterations
50
55
60
65
70
75
80
85
−lo
gPr(
x|w
)
NetflixPGDEGSMACCCP
0 10 20 30 40 50
Iterations
30
40
50
60
70
80
90
−lo
gPr(
x|w
)
AccidentsPGDEGSMACCCP
0 10 20 30 40 50
Iterations
0
20
40
60
80
100
120
−lo
gPr(
x|w
)
RetailPGDEGSMACCCP
0 10 20 30 40 50
Iterations
20
40
60
80
100
120
140
−lo
gPr(
x|w
)
Pumsb StarPGDEGSMACCCP
0 10 20 30 40 50
Iterations
90
100
110
120
130
140
150
−lo
gPr(
x|w
)
DNAPGDEGSMACCCP
0 10 20 30 40 50
Iterations
0
20
40
60
80
100
120
140
160
−lo
gPr(
x|w
)
KosarekPGDEGSMACCCP
0 10 20 30 40 50
Iterations
0
50
100
150
200
250
−lo
gPr(
x|w
)
MSWebPGDEGSMACCCP
0 10 20 30 40 50
Iterations
0
50
100
150
200
250
300
350
400
450
−lo
gPr(
x|w
)
BookPGDEGSMACCCP
0 10 20 30 40 50
Iterations
0
50
100
150
200
250
300
350
400
450
−lo
gPr(
x|w
)
EachMoviePGDEGSMACCCP
0 10 20 30 40 50
Iterations
100
200
300
400
500
600
700
−lo
gPr(
x|w
)
WebKBPGDEGSMACCCP
0 10 20 30 40 50
Iterations
0
100
200
300
400
500
600
700
800
−lo
gPr(
x|w
)
Reuters-52PGDEGSMACCCP
0 5 10 15 20 25 30 35 40
Iterations
100
200
300
400
500
600
700
800
−lo
gPr(
x|w
)
20 NewsgroupPGDEGSMACCCP
0 10 20 30 40 50
Iterations
200
300
400
500
600
700
800
900
−lo
gPr(
x|w
)
BBCPGDEGSMACCCP
0 10 20 30 40 50
Iterations
0
200
400
600
800
1000
1200
1400
−lo
gPr(
x|w
)
AdPGDEGSMACCCP
Figure 1: Negative log-likelihood values versus number of iterations for PGD,EG, SMA and CCCP on 20 benchmark datasets.
• CCCP consistently outperforms all the other three algorithms.• We suggest CCCP for maximum likelihood estimation, and CVB forBayesian learning of SPNs (See our ICML 2016 paper).