1
TASK: SUBSET SELECTION Two goals: relevance and diversity Quality learning: learn weight for each row of K Kulesza and Taskar (UAI 2011) Code + Data: https://code.google.com/p/em-for-dpps Relevance only: Relevance + diversity: ... ... SIMILARITY KERNEL DETERMINANTAL POINT PROCESS Exact and efficient inference: or faster • marginalization: • normalized probabilities: • conditioning: • sampling: But: DPP max likelihood learning is (conjectured) NP-hard DPP INFERENCE P (Y Y ) = det(K Y ) P (Y = Y | A Y ), P (Y = Y | A \ Y = ;) O (N 3 ) DPP LOG-LIKELIHOOD EXPECTATION-MAXIMIZATION Ground set of N items: Training data (example subsets): Expectation-Maximization for Learning Determinantal Point Processes PRIOR WORK Sum-of-experts model: weighted sum of fixed DPPs Kulesza and Taskar (ICML 2011) Parametric kernel: K assumed Gaussian, poly, etc. Affandi et al. (ICML 2014) This work: no fixed values or restrictive parameterizations for K LOG-LIKELIHOOD IS NON-CONCAVE BASELINE: PROJECTED GRADIENT Training data: PRODUCT RECOMMENDATION DATASET LOWER-BOUNDING COORDINATE ASCENT E-STEP M-STEP LOG-LIKELIHOOD RESULTS EX: “SAFETY” CATEGORY VTech Comm. Audio Monitor Infant Optics Video Monitor Graco Sweet Slumber Sound Boppy Noggin Nest Head Support Cloud b Twilight Night Light Braun ThermoScan Lens Filters Britax EZ-Cling Sun Shades TL Care Organic Cotton Mittens Regalo Easy Step Walk Thru Gate Aquatopia Bath Thermometer Alarm ~30,000 Amazon baby registries, 13 product categories furniture carseats toys ... Left: 10-item registry for the “safety” category under the model learned by projected gradient. Right: The EM model replaces two redundant items (crossed out) with two more diverse items. Goal: learn a DPP for each category RUNTIME RESULTS NAIVE INITIALIZATION MOMENTS-MATCHING INITIALIZATION Draw starting point from a Wishart with N degrees of freedom Data-dependent starting point m i = 1 T P T t=1 1(i 2 Y t ), m ij = 1 T P T t=1 1(i 2 Y t ^ j 2 Y t ) ˜ K ii = m i , ˜ K ij = r max ˜ K ii ˜ K jj - m ij , 0 T 1 = iterations to converge T 2 = iterations to find a step size T = # of example sets N = ground set size Low-data setting Using all data: EM and baseline are comparable T =2N K J = V J diag(1)(V J ) > Eigendecompose Sample “hidden” eigenvectors J v 4 v 6 v 1 v 2 v 3 v 5 λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 v 4 v 6 v 1 v 2 v 3 v 5 Sample subset based on J Y DPP(K J ) GENERATIVE DPP SAMPLING Main idea: Exploit hidden variable J to develop EM-style optimization L(K )= L(V, )= T X t=1 log p K (Y t )= T X t=1 log X J p K (J, Y t ) ! = T X t=1 log X J q (J | Y t ) p K (J, Y t ) q (J | Y t ) ! T X t=1 X J q (J | Y t ) log p K (J, Y t ) q (J | Y t ) F (q,V, ) E-step: max q F = min q T X t=1 KL(q (J | Y t ) k p K (J | Y t )) M-step: max V,F = max V,T X t=1 E q [log p K (J, Y t )] s.t. 0 λ 1,V > V = I Set distributions equal to minimize KL: Theorem: If p K (Y | J ) is a DPP distribution compactly parameterized by K , then p K (J | Y ) is also a DPP distribution with a related compact parameterization computable in at most O (N 3 ) time. q (J | Y t )= p K (J | Y t ) The eigenvalue update is closed-form. Constraints are implicitly enforced, since marginals of q are in [0,1]. (No need for a projection step.) Because q is a DPP , the marginalization required for this update is efficient. O (TNk 2 ), for k = max t |Y t | The eigenvector update is more complicated, but still efficient. The constraints correspond to optimization over Stiefel manifold, and Edelman et al. (SIMAX 1998) show that the below update stays on the manifold. Infant Optics Video Monitor Motorola Video Monitor Summer Infant Video Monitor Aquatopia Bath Thermometer Alarm Regalo Easy Step Walk Thru Gate Infant Optics Video Monitor 0.3 0.0 0.0 0.0 0.0 0.5 0.38 0.38 0.4 P (Y = Y )= | det(K - I Y )| Example: product recommendation Y DPP(K ) = ) P (Y Y ) = det(K Y ) L(K )= T P t=1 log ( | det(K - I Y t )| ) 0 0.1 0.2 0.3 0.4 0.5 1.49 1.48 1.47 η loglikelihood K = 0.5 0.0 0.0 0.4 + 0.1 0.6 0.6 -0.1 1 1 λ 1 λ 2 K (0) K (0) + @ L(K (0) ) @ K K (1) Main idea: Exploit hidden variable J to develop EM-style optimization Gradient: 0 λ 1 @ L(K ) @ K = T P t=1 (K - I Y t ) -1 Project to satisfy constraints on eigenvalues of K: safety furniture carseats strollers health bath media toys bedding apparel diaper gear feeding relative log likelihood difference: 100(EM - projgrad) / |projgrad| 0.0 0.0 0.5 0.9 1.3 1.8 2.4 2.5 2.5 7.7 8.1 9.8 11.0 safety furniture carseats strollers health bath media toys bedding apparel diaper gear feeding relative log likelihood difference: 100(EM - projgrad) / |projgrad| 0.6 2.6 -0.1 1.5 3.1 2.3 1.9 3.5 5.3 5.3 5.8 10.4 16.5 feeding gear bedding bath apparel diaper media furniture health toys safety carseats strollers speedup factor: projgrad runtime / EM runtime 0.4 0.5 0.6 0.7 0.7 1.1 1.3 1.3 1.9 2.2 4.0 6.0 7.4 P (Y )= | det(K - I Y )| = X J :J {1,...,N } P K J (Y ) Y j :j 2J λ j Y j :j/ 2J (1 - λ j ) = X J :J {1,...,N } p K (Y | J )p K (J ) EM : O (T 1 T 2 N 3 + T 1 T Nk 2 ) Proj grad : O (T 1 T 2 N 3 + T 1 T N 3 ) λ j = 1 T T X t=1 X J :j 2J q (J | Y t )= 1 T T X t=1 q (j 2 J | Y i ) concave in q concave in not concave in V V 0 V exp h V > @ L @ V - ( @ L @ V ) > V ⌘i EIGENVALUES EIGENVECTORS K = V V > K Y =[V V > ] Y = V Y (V Y ) > = ) q (J | Y ) / det V > Y V Y diag λ 1 - λ J Motorola Video Monitor Summer Infant Video Monitor X X Y DPP(K ) P Y = det(K {1,2} ) = det 0.5 0.38 0.38 0.4 =0.05 P Y = det(K {1,3} ) = det 0.5 0.0 0.0 0.3 =0.15 P ( Y ) = det(K {1} )= K 11 =0.5 K = Constraints: 0 K 1

Expectation-Maximization for Learning Determinantal Point ...jgillenw.com/nips2014-poster.pdfAffandi et al. (ICML 2014) This work: no fixed values or restrictive parameterizations

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Expectation-Maximization for Learning Determinantal Point ...jgillenw.com/nips2014-poster.pdfAffandi et al. (ICML 2014) This work: no fixed values or restrictive parameterizations

TASK: SUBSET SELECTION

Two goals: relevance and diversity

Quality learning:learn weight for each row of KKulesza and Taskar (UAI 2011)

Code + Data: https://code.google.com/p/em-for-dpps

Relevance only:

Relevance + diversity:

...

...

SIMILARITY KERNEL

DETERMINANTAL POINT PROCESS

Exact and efficient inference: or faster• marginalization:• normalized probabilities:• conditioning:• sampling:

But: DPP max likelihood learning is (conjectured) NP-hard

DPP INFERENCE

P(Y ✓ Y ) = det(KY )

P(Y = Y | A ✓ Y ),P(Y = Y | A \ Y = ;)

O(N3)

DPP LOG-LIKELIHOOD

EXPECTATION-MAXIMIZATION

Ground set of N items:

Training data (example subsets):

Expectation-Maximization for Learning Determinantal Point Processes

PRIOR WORK

Sum-of-experts model:weighted sum of fixed DPPs

Kulesza and Taskar (ICML 2011)

Parametric kernel:K assumed Gaussian, poly, etc.

Affandi et al. (ICML 2014)

This work:no fixed values or restrictive

parameterizations for K

LOG-LIKELIHOOD IS NON-CONCAVE

BASELINE: PROJECTED GRADIENT

Training data:

PRODUCT RECOMMENDATION DATASET

LOWER-BOUNDING

COORDINATE ASCENT

E-STEP

M-STEP

LOG-LIKELIHOOD RESULTS

EX: “SAFETY” CATEGORY

VTech Comm. Audio Monitor

Infant Optics Video Monitor

Graco Sweet Slumber Sound

Boppy Noggin Nest Head Support

Cloud b Twilight Night Light

Braun ThermoScan Lens Filters

Britax EZ-Cling Sun Shades

TL Care Organic Cotton Mittens

Regalo Easy Step Walk Thru Gate

Aquatopia Bath Thermometer Alarm

~30,000 Amazon baby registries, 13 product categories

furniture carseats toys

...

Left: 10-item registry for the “safety” category under the model learned by projected gradient. Right: The EM model replaces two redundant

items (crossed out) with two more diverse items.

Goal: learn a DPP for each category

RUNTIME RESULTS

NAIVE INITIALIZATION

MOMENTS-MATCHING INITIALIZATION

Draw starting point from a Wishart with

N degrees of freedom

Data-dependent starting point

mi =1T

PTt=1 1(i 2 Yt),

mij =1T

PTt=1 1(i 2 Yt ^ j 2 Yt)

˜Kii = mi,

˜Kij =

rmax

⇣˜Kii

˜Kjj �mij , 0⌘

T1 = iterations to converge

T2 = iterations to find a step size

T = # of example sets

N = ground set size

Low-data setting

Using all data: EM and baseline are comparable

T = 2N

KJ = V Jdiag(1)(V J)>

Eigendecompose

Sample “hidden”

eigenvectors J

v4 v6v1 v2 v3 v5

�1 �2 �3 �4 �5 �6

v4 v6v1 v2 v3 v5

Sample subset

based on J

Y ⇠ DPP(KJ)Step 1 Step 2 Step 3 Step 4

Step 5 Step 6 Step 7 Step 8

GENERATIVE DPP SAMPLING

Main idea: Exploit hidden variable J to develop EM-style optimization

L(K) = L(V,⇤) =TX

t=1

log pK(Yt) =

TX

t=1

log

X

J

pK(J, Yt)

!

=

TX

t=1

log

X

J

q(J | Yt)pK(J, Yt)

q(J | Yt)

!

�TX

t=1

X

J

q(J | Yt) log

✓pK(J, Yt)

q(J | Yt)

◆⌘ F (q, V,⇤)

E-step: max

qF = min

q

TX

t=1

KL(q(J | Yt) k pK(J | Yt))

M-step: max

V,⇤F = max

V,⇤

TX

t=1

Eq[log pK(J, Yt)]

s.t. 0 � 1, V >V = I

Set distributions equal to minimize KL:

Theorem: If pK(Y | J) is a DPP distribution compactly parameterized by K,

then pK(J | Y ) is also a DPP distribution with a related compact

parameterization computable in at most O(N3) time.

q(J | Yt) = pK(J | Yt)

The eigenvalue update is closed-form. Constraints are implicitly enforced, since marginals of q are in [0,1]. (No need for a projection step.)

Because q is a DPP, the marginalization required for this update is efficient.

O(TNk2), for k = maxt |Yt|

The eigenvector update is more complicated, but still efficient. The constraints correspond to optimization over Stiefel manifold, and Edelman et al. (SIMAX 1998) show that the below update stays on the manifold.

Infant Optics Video Monitor

Motorola Video Monitor

Summer Infant Video Monitor

Aquatopia Bath Thermometer Alarm

Regalo Easy Step Walk Thru Gate

Infant Optics Video Monitor

0.30.0

0.0

0.0

0.0

0.5 0.38

0.38 0.4

P(Y = Y ) = | det(K � IY )|

Example: product recommendation

Y ⇠ DPP(K) =) P(Y ✓ Y ) = det(KY )

L(K) =

TPt=1

log

�| det(K � IY t

)|�

0 0.1 0.2 0.3 0.4 0.5

−1.49

−1.48

−1.47

η

log−likelihoodK =

0.5 0.00.0 0.4

�+

0.1 0.60.6 �0.1

1

1�1

�2

K(0)

K(0) + ⌘ @L(K(0))@KK(1)

Main idea: Exploit hidden variable J to develop EM-style optimization

Gradient:

0 � 1

@L(K)@K =

TPt=1

(K � IY t)�1

Project to satisfy constraints on

eigenvalues of K:

safety furniturecarseatsstrollers

health bath

media toys

bedding apparel diaper

gear feeding

relative log likelihood difference: 100(EM - projgrad) / |projgrad|

0.00.0

0.50.9

1.31.8

2.42.52.5

7.78.1

9.811.0

safety furniturecarseatsstrollers

health bath

media toys

bedding apparel diaper

gear feeding

relative log likelihood difference: 100(EM - projgrad) / |projgrad|

0.62.6

-0.11.5

3.12.3

1.93.5

5.35.3

5.810.4

16.5

feeding gear

bedding bath

apparel diaper media

furniturehealth

toys safety

carseats strollers

speedup factor: projgrad runtime / EM runtime

0.40.50.60.70.7

1.11.31.3

1.92.2

4.06.0

7.4

P(Y ) = | det(K � IY )|

=X

J:J✓{1,...,N}

PKJ (Y )Y

j:j2J

�j

Y

j:j /2J

(1� �j)

=X

J:J✓{1,...,N}

pK(Y | J)pK(J)

EM : O(T1T2N3+ T1TNk2)

Proj grad : O(T1T2N3+ T1TN

3)

�j =1

T

TX

t=1

X

J:j2J

q(J | Yt) =1

T

TX

t=1

q(j 2 J | Yi)

concave in q

concave in ⇤

not concave in V

V 0 V exp

h⌘⇣V > @L

@V ��@L@V

�>V⌘i

EIGENVALUES

EIGENVECTORS

K = V ⇤V >

KY = [V ⇤V >]Y = V Y ⇤(V Y )> =)

q(J | Y ) / det

✓V >

Y V Y diag

✓�

1� �

◆�

J

Motorola Video Monitor

Summer Infant Video Monitor

X XY ⇠ DPP(K)

P⇣

✓ Y⌘= det(K{1,2}) = det

✓0.5 0.380.38 0.4

◆= 0.05

P⇣

✓ Y⌘= det(K{1,3}) = det

✓0.5 0.00.0 0.3

◆= 0.15

P�

✓ Y�= det(K{1}) = K11 = 0.5

K =Constraints:

0 � K � 1