A Unified Approach for Learning the Parameters of Sum-Product Networkshzhao1/papers/NIPS2016/spn... · 2019-11-02 · Parameters of Sum-Product Networks ... David R. Cheriton School

A Unified Approach for Learning theParameters of Sum-Product Networks

Han Zhao†, Pascal Poupart? and Geoff Gordon††han.zhao, [email protected], [email protected]†Machine Learning Department, Carnegie Mellon University

?David R. Cheriton School of Computer Science, University of Waterloo

Introduction

• We present a unified approach for learning the parameters ofSum-Product networks (SPNs).

• We construct a more efficient factorization of complete anddecomposable SPN into a mixture of trees, with each tree being aproduct of univariate distributions.

• We show that the MLE problem for SPNs can be formulated as asignomial program.

• We construct two parameter learning algorithms for SPNs by usingsequential monomial approximations (SMA) and theconcave-convex procedure (CCCP). Both SMA and CCCP admitmultiplicative weight updates.

• We prove the convergence of CCCP on SPNs.

BackgroundSum-Product Networks (SPNs):• Rooted directed acyclic graph of univariate distributions, sum nodesand product nodes.

• We focus on discrete SPNs, but the proposed algorithms work forcontinuous ones as well.

Recursive computation of the network:

Vk(x | w) =

p(Xi = xi) k is a leaf node over Xi∏j∈Ch(k) Vj(x | w) k is a product node∑j∈Ch(k)wkjVj(x | w) k is a sum node

Scope: The set of variables that have univariate distributions amongthe node’s descendants.Complete: An SPN is complete iff each sum node has children withthe same scope.Decomposable: An SPN is decomposable iff for every product nodev, scope(vi)

⋂ scope(vj) = ∅ where vi, vj ∈ Ch(v), i 6= j.

A Unified Framework for Learning(SPNs as a Mixture of Trees) Theorem 1:Every complete and decomposable SPN S can be factorized into a sumof Ω(2h) induced trees (sub-graphs), where each tree corresponds to aproduct of univariate distributions. h is the height of S.

+

× × ×

Ix1 Ix1 Ix2 Ix2

w1 w2w3

= w1

+

×

Ix1 Ix2

+w2

+

×

Ix1 Ix2

+w3

+

×

Ix1 Ix2

Maximum Likelihood Estimation as Signomial Program:The MLE optimization is:

maximizewfS(x|w)fS(1|w)

=∑τt=1

∏Nn=1 I(t)

xn

∏Dd=1w

Iwd∈Ttd∑τ

t=1∏Dd=1w

Iwd∈Ttd

subject to w ∈ RD++

τ = Ω(2h). D is the number of parameters in S. N is the number ofrandom variables modeled by S. Tt is the t-th induced tree.Proposition 2:The MLE problem for SPNs is a signomial program.Logarithmic transformation leads to a difference of convex functions:

maximize logτ (x)∑t=1

exp D∑d=1

ydIyd∈Tt

− log

τ∑t=1

exp D∑d=1

ydIyd∈Tt

Sequential Monomial Approximation (SMA): Optimal linear ap-proximation in log-space, corresponds to the optimal monomial func-tion approximation to the original signomial.Concave-Convex Procedure (CCCP): Sequential convex relaxationby linearizing the first term, with efficient O(|S|) closed form solverfor each convex sub-problem.

(Convergence of CCCP) Theorem 2:Let w(k)∞k=1 be any sequence generated by CCCP from any feasibleinitial point. Then all the limiting points of w(k)∞k=1 are stationarypoints of the difference of convex functions program (DCP). In addi-tion, limk→∞ f (y(k)) = f (y∗), where y∗ is some stationary point of theDCP, i.e., the sequence of objective function values converges.

Algo Update Type Update FormulaPGD Additive w

(k+1)d ← PRε

++w(k)d + γ∇wdf (w(k))

EG Multiplicative w(k+1)d ← w

(k)d expγ∇wdf (w(k))

SMA Multiplicative w(k+1)d ← w

(k)d expγw(k)

d ∇wdf (w(k))CCCP Multiplicative w

(k+1)ij ∝ w

(k)ij · ∇vifS(w(k)) · fvj(w(k))

Experiments

0 10 20 30 40 50

Iterations

5

6

7

8

9

10

11

12

−lo

gPr(

x|w

)

NLTCSPGDEGSMACCCP

0 10 20 30 40 50

Iterations

6

7

8

9

10

11

12

13

−lo

gPr(

x|w

)

MSNBCPGDEGSMACCCP

0 10 20 30 40 50

Iterations

0

10

20

30

40

50

−lo

gPr(

x|w

)

KDD 2000PGDEGSMACCCP

0 10 20 30 40 50

Iterations

10

15

20

25

30

35

40

45

50

55

−lo

gPr(

x|w

)

PlantsPGDEGSMACCCP

0 10 20 30 40 50

Iterations

35

40

45

50

55

60

65

70

75

−lo

gPr(

x|w

)

AudioPGDEGSMACCCP

0 10 20 30 40 50

Iterations

50

55

60

65

70

75

−lo

gPr(

x|w

)

JesterPGDEGSMACCCP

0 10 20 30 40 50

Iterations

50

55

60

65

70

75

80

85

−lo

gPr(

x|w

)

NetflixPGDEGSMACCCP

0 10 20 30 40 50

Iterations

30

40

50

60

70

80

90

−lo

gPr(

x|w

)

AccidentsPGDEGSMACCCP

0 10 20 30 40 50

Iterations

0

20

40

60

80

100

120

−lo

gPr(

x|w

)

RetailPGDEGSMACCCP

0 10 20 30 40 50

Iterations

20

40

60

80

100

120

140

−lo

gPr(

x|w

)

Pumsb StarPGDEGSMACCCP

0 10 20 30 40 50

Iterations

90

100

110

120

130

140

150

−lo

gPr(

x|w

)

DNAPGDEGSMACCCP

0 10 20 30 40 50

Iterations

0

20

40

60

80

100

120

140

160

−lo

gPr(

x|w

)

KosarekPGDEGSMACCCP

0 10 20 30 40 50

Iterations

0

50

100

150

200

250

−lo

gPr(

x|w

)

MSWebPGDEGSMACCCP

0 10 20 30 40 50

Iterations

0

50

100

150

200

250

300

350

400

450

−lo

gPr(

x|w

)

BookPGDEGSMACCCP

0 10 20 30 40 50

Iterations

0

50

100

150

200

250

300

350

400

450

−lo

gPr(

x|w

)

EachMoviePGDEGSMACCCP

0 10 20 30 40 50

Iterations

100

200

300

400

500

600

700

−lo

gPr(

x|w

)

WebKBPGDEGSMACCCP

0 10 20 30 40 50

Iterations

0

100

200

300

400

500

600

700

800

−lo

gPr(

x|w

)

Reuters-52PGDEGSMACCCP

0 5 10 15 20 25 30 35 40

Iterations

100

200

300

400

500

600

700

800

−lo

gPr(

x|w

)

20 NewsgroupPGDEGSMACCCP

0 10 20 30 40 50

Iterations

200

300

400

500

600

700

800

900

−lo

gPr(

x|w

)

BBCPGDEGSMACCCP

0 10 20 30 40 50

Iterations

0

200

400

600

800

1000

1200

1400

−lo

gPr(

x|w

)

AdPGDEGSMACCCP

Figure 1: Negative log-likelihood values versus number of iterations for PGD,EG, SMA and CCCP on 20 benchmark datasets.

• CCCP consistently outperforms all the other three algorithms.• We suggest CCCP for maximum likelihood estimation, and CVB forBayesian learning of SPNs (See our ICML 2016 paper).

Documents

A Unified Approach for Learning the Parameters of Sum-Product Networkshzhao1/papers/NIPS2016/spn... · 2019-11-02 · Parameters of Sum-Product Networks ... David R. Cheriton School