1
A Unified Approach for Learning the Parameters of Sum-Product Networks Han Zhao , Pascal Poupart and Geoff Gordon {han.zhao, ggordon}@cs.cmu.edu, [email protected] Machine Learning Department, Carnegie Mellon University David R. Cheriton School of Computer Science, University of Waterloo Introduction We present a unified approach for learning the parameters of Sum-Product networks (SPNs). We construct a more efficient factorization of complete and decomposable SPN into a mixture of trees, with each tree being a product of univariate distributions. We show that the MLE problem for SPNs can be formulated as a signomial program. We construct two parameter learning algorithms for SPNs by using sequential monomial approximations (SMA) and the concave-convex procedure (CCCP). Both SMA and CCCP admit multiplicative weight updates. We prove the convergence of CCCP on SPNs. Background Sum-Product Networks (SPNs): Rooted directed acyclic graph of univariate distributions, sum nodes and product nodes. We focus on discrete SPNs, but the proposed algorithms work for continuous ones as well. Recursive computation of the network: V k (x | w)= p(X i = x i ) k is a leaf node over X i Q j Ch(k ) V j (x | w) k is a product node j Ch(k ) w kj V j (x | w) k is a sum node Scope: The set of variables that have univariate distributions among the node’s descendants. Complete: An SPN is complete iff each sum node has children with the same scope. Decomposable: An SPN is decomposable iff for every product node v , scope(v i ) T scope(v j ) = where v i ,v j Ch(v ),i = j . A Unified Framework for Learning (SPNs as a Mixture of Trees) Theorem 1: Every complete and decomposable SPN S can be factorized into a sum of Ω(2 h ) induced trees (sub-graphs), where each tree corresponds to a product of univariate distributions. h is the height of S . + × × × I x1 I ¯ x1 I x2 I ¯ x2 w 1 w 2 w 3 = w 1 + × I x1 I x2 +w 2 + × I x1 I ¯ x2 +w 3 + × I ¯ x1 I ¯ x2 Maximum Likelihood Estimation as Signomial Program: The MLE optimization is: maximize w f S (x|w) f S (1|w) = τ t=1 Q N n=1 I (t) x n Q D d=1 w I w d ∈T t d τ t=1 Q D d=1 w I w d ∈T t d subject to w R D ++ τ = Ω(2 h ). D is the number of parameters in S . N is the number of random variables modeled by S . T t is the t-th induced tree. Proposition 2: The MLE problem for SPNs is a signomial program. Logarithmic transformation leads to a difference of convex functions: maximize log τ (x) X t=1 exp D X d=1 y d I y d ∈T t - log τ X t=1 exp D X d=1 y d I y d ∈T t Sequential Monomial Approximation (SMA): Optimal linear ap- proximation in log-space, corresponds to the optimal monomial func- tion approximation to the original signomial. Concave-Convex Procedure (CCCP): Sequential convex relaxation by linearizing the first term, with efficient O (|S|) closed form solver for each convex sub-problem. (Convergence of CCCP) Theorem 2: Let {w (k ) } k =1 be any sequence generated by CCCP from any feasible initial point. Then all the limiting points of {w (k ) } k =1 are stationary points of the difference of convex functions program (DCP). In addi- tion, lim k →∞ f (y (k ) )= f (y * ), where y * is some stationary point of the DCP, i.e., the sequence of objective function values converges. Algo Update Type Update Formula PGD Additive w (k +1) d P R ++ {w (k ) d + γ w d f (w (k ) )} EG Multiplicative w (k +1) d w (k ) d exp{γ w d f (w (k ) )} SMA Multiplicative w (k +1) d w (k ) d exp{γw (k ) d w d f (w (k ) )} CCCP Multiplicative w (k +1) ij w (k ) ij ·∇ v i f S (w (k ) ) · f v j (w (k ) ) Experiments Figure 1: Negative log-likelihood values versus number of iterations for PGD, EG, SMA and CCCP on 20 benchmark datasets. CCCP consistently outperforms all the other three algorithms. We suggest CCCP for maximum likelihood estimation, and CVB for Bayesian learning of SPNs (See our ICML 2016 paper).

A Unified Approach for Learning the Parameters of Sum-Product Networkshzhao1/papers/NIPS2016/spn... · 2019-11-02 · Parameters of Sum-Product Networks ... David R. Cheriton School

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Unified Approach for Learning the Parameters of Sum-Product Networkshzhao1/papers/NIPS2016/spn... · 2019-11-02 · Parameters of Sum-Product Networks ... David R. Cheriton School

A Unified Approach for Learning theParameters of Sum-Product Networks

Han Zhao†, Pascal Poupart? and Geoff Gordon††han.zhao, [email protected], [email protected]†Machine Learning Department, Carnegie Mellon University

?David R. Cheriton School of Computer Science, University of Waterloo

Introduction

• We present a unified approach for learning the parameters ofSum-Product networks (SPNs).

• We construct a more efficient factorization of complete anddecomposable SPN into a mixture of trees, with each tree being aproduct of univariate distributions.

• We show that the MLE problem for SPNs can be formulated as asignomial program.

• We construct two parameter learning algorithms for SPNs by usingsequential monomial approximations (SMA) and theconcave-convex procedure (CCCP). Both SMA and CCCP admitmultiplicative weight updates.

• We prove the convergence of CCCP on SPNs.

BackgroundSum-Product Networks (SPNs):• Rooted directed acyclic graph of univariate distributions, sum nodesand product nodes.

• We focus on discrete SPNs, but the proposed algorithms work forcontinuous ones as well.

Recursive computation of the network:

Vk(x | w) =

p(Xi = xi) k is a leaf node over Xi∏j∈Ch(k) Vj(x | w) k is a product node∑j∈Ch(k)wkjVj(x | w) k is a sum node

Scope: The set of variables that have univariate distributions amongthe node’s descendants.Complete: An SPN is complete iff each sum node has children withthe same scope.Decomposable: An SPN is decomposable iff for every product nodev, scope(vi)

⋂ scope(vj) = ∅ where vi, vj ∈ Ch(v), i 6= j.

A Unified Framework for Learning(SPNs as a Mixture of Trees) Theorem 1:Every complete and decomposable SPN S can be factorized into a sumof Ω(2h) induced trees (sub-graphs), where each tree corresponds to aproduct of univariate distributions. h is the height of S.

+

× × ×

Ix1 Ix1 Ix2 Ix2

w1 w2w3

= w1

+

×

Ix1 Ix2

+w2

+

×

Ix1 Ix2

+w3

+

×

Ix1 Ix2

Maximum Likelihood Estimation as Signomial Program:The MLE optimization is:

maximizewfS(x|w)fS(1|w)

=∑τt=1

∏Nn=1 I(t)

xn

∏Dd=1w

Iwd∈Ttd∑τ

t=1∏Dd=1w

Iwd∈Ttd

subject to w ∈ RD++

τ = Ω(2h). D is the number of parameters in S. N is the number ofrandom variables modeled by S. Tt is the t-th induced tree.Proposition 2:The MLE problem for SPNs is a signomial program.Logarithmic transformation leads to a difference of convex functions:

maximize logτ (x)∑t=1

exp D∑d=1

ydIyd∈Tt

− log

τ∑t=1

exp D∑d=1

ydIyd∈Tt

Sequential Monomial Approximation (SMA): Optimal linear ap-proximation in log-space, corresponds to the optimal monomial func-tion approximation to the original signomial.Concave-Convex Procedure (CCCP): Sequential convex relaxationby linearizing the first term, with efficient O(|S|) closed form solverfor each convex sub-problem.

(Convergence of CCCP) Theorem 2:Let w(k)∞k=1 be any sequence generated by CCCP from any feasibleinitial point. Then all the limiting points of w(k)∞k=1 are stationarypoints of the difference of convex functions program (DCP). In addi-tion, limk→∞ f (y(k)) = f (y∗), where y∗ is some stationary point of theDCP, i.e., the sequence of objective function values converges.

Algo Update Type Update FormulaPGD Additive w

(k+1)d ← PRε

++w(k)d + γ∇wdf (w(k))

EG Multiplicative w(k+1)d ← w

(k)d expγ∇wdf (w(k))

SMA Multiplicative w(k+1)d ← w

(k)d expγw(k)

d ∇wdf (w(k))CCCP Multiplicative w

(k+1)ij ∝ w

(k)ij · ∇vifS(w(k)) · fvj(w(k))

Experiments

0 10 20 30 40 50

Iterations

5

6

7

8

9

10

11

12

−lo

gPr(

x|w

)

NLTCSPGDEGSMACCCP

0 10 20 30 40 50

Iterations

6

7

8

9

10

11

12

13

−lo

gPr(

x|w

)

MSNBCPGDEGSMACCCP

0 10 20 30 40 50

Iterations

0

10

20

30

40

50

−lo

gPr(

x|w

)

KDD 2000PGDEGSMACCCP

0 10 20 30 40 50

Iterations

10

15

20

25

30

35

40

45

50

55

−lo

gPr(

x|w

)

PlantsPGDEGSMACCCP

0 10 20 30 40 50

Iterations

35

40

45

50

55

60

65

70

75

−lo

gPr(

x|w

)

AudioPGDEGSMACCCP

0 10 20 30 40 50

Iterations

50

55

60

65

70

75

−lo

gPr(

x|w

)

JesterPGDEGSMACCCP

0 10 20 30 40 50

Iterations

50

55

60

65

70

75

80

85

−lo

gPr(

x|w

)

NetflixPGDEGSMACCCP

0 10 20 30 40 50

Iterations

30

40

50

60

70

80

90

−lo

gPr(

x|w

)

AccidentsPGDEGSMACCCP

0 10 20 30 40 50

Iterations

0

20

40

60

80

100

120

−lo

gPr(

x|w

)

RetailPGDEGSMACCCP

0 10 20 30 40 50

Iterations

20

40

60

80

100

120

140

−lo

gPr(

x|w

)

Pumsb StarPGDEGSMACCCP

0 10 20 30 40 50

Iterations

90

100

110

120

130

140

150

−lo

gPr(

x|w

)

DNAPGDEGSMACCCP

0 10 20 30 40 50

Iterations

0

20

40

60

80

100

120

140

160

−lo

gPr(

x|w

)

KosarekPGDEGSMACCCP

0 10 20 30 40 50

Iterations

0

50

100

150

200

250

−lo

gPr(

x|w

)

MSWebPGDEGSMACCCP

0 10 20 30 40 50

Iterations

0

50

100

150

200

250

300

350

400

450

−lo

gPr(

x|w

)

BookPGDEGSMACCCP

0 10 20 30 40 50

Iterations

0

50

100

150

200

250

300

350

400

450

−lo

gPr(

x|w

)

EachMoviePGDEGSMACCCP

0 10 20 30 40 50

Iterations

100

200

300

400

500

600

700

−lo

gPr(

x|w

)

WebKBPGDEGSMACCCP

0 10 20 30 40 50

Iterations

0

100

200

300

400

500

600

700

800

−lo

gPr(

x|w

)

Reuters-52PGDEGSMACCCP

0 5 10 15 20 25 30 35 40

Iterations

100

200

300

400

500

600

700

800

−lo

gPr(

x|w

)

20 NewsgroupPGDEGSMACCCP

0 10 20 30 40 50

Iterations

200

300

400

500

600

700

800

900

−lo

gPr(

x|w

)

BBCPGDEGSMACCCP

0 10 20 30 40 50

Iterations

0

200

400

600

800

1000

1200

1400

−lo

gPr(

x|w

)

AdPGDEGSMACCCP

Figure 1: Negative log-likelihood values versus number of iterations for PGD,EG, SMA and CCCP on 20 benchmark datasets.

• CCCP consistently outperforms all the other three algorithms.• We suggest CCCP for maximum likelihood estimation, and CVB forBayesian learning of SPNs (See our ICML 2016 paper).