18
9/27/2018 1 CS 3750 Advanced Machine Learning CS 3750 Advanced Machine Learning Lecture 8 Latent Variable Models Tianyi Cui [email protected] Sept 20, 2018 CS 3750 Advanced Machine Learning Outline β€’ Latent Variable Models β€’ Probabilistic Principle Component Analysis β€’ Model Formulation β€’ Maximum Likelihood Estimation β€’ Expectation Maximum Algorithm for pPCA β€’ Cooperative Vector Quantizer Model β€’ Model Formulation β€’ Variational Bayesian Learning for CVQ β€’ Noisy-OR Component Analyzer β€’ Model Formulation β€’ Variational EM for NOCA β€’ References 1

Latent Variable Models - University of Pittsburgh

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Latent Variable Models - University of Pittsburgh

9/27/2018

1

CS 3750 Advanced Machine Learning

CS 3750 Advanced Machine LearningLecture 8

Latent Variable Models

Tianyi Cui

[email protected]

Sept 20, 2018

CS 3750 Advanced Machine Learning

Outline

β€’ Latent Variable Models

β€’ Probabilistic Principle Component Analysisβ€’ Model Formulation

β€’ Maximum Likelihood Estimation

β€’ Expectation Maximum Algorithm for pPCA

β€’ Cooperative Vector Quantizer Modelβ€’ Model Formulation

β€’ Variational Bayesian Learning for CVQ

β€’ Noisy-OR Component Analyzerβ€’ Model Formulation

β€’ Variational EM for NOCA

β€’ References

1

Page 2: Latent Variable Models - University of Pittsburgh

9/27/2018

2

CS 3750 Advanced Machine Learning

Latent Variable Models

2

Name High School Grade

University Grade

IQ score Phone Interview

Onsite Interview

John 4.0 4.0 120 3/4 ?

Helen 3.2 N/A 112 2/4 ?

Sophia 3.5 3.6 N/A 4/4 85/100

Jack 3.6 N/A N/A 3/4 ?

IQscore

GPA

High School Grade

Onsite Interview

Phone Interview

CS 3750 Advanced Machine Learning

Latent Variable Models

3

Name High School Grade

University Grade

IQ score Phone Interview

Onsite Interview

John 4.0 4.0 120 3/4 ?

Helen 3.2 N/A 112 2/4 ?

Sophia 3.5 3.6 N/A 4/4 85/100

Jack 3.6 N/A N/A 3/4 ?

IQscoreGPA

High School Grade

Onsite Interview

Phone Interview

Intelligence

Page 3: Latent Variable Models - University of Pittsburgh

9/27/2018

3

CS 3750 Advanced Machine Learning

Latent Variable Models

Pros:β€’ Dimension reductionβ€’ Explain correlation of observed data at latent variables levelβ€’ Inferring state of latent variables

β€’Cons:β€’ Hard to work with

4

…

… Observed variables 𝐱 : 𝒅-dimensions

Latent variables 𝒔: 𝒒-dimensions

π‘ž < 𝑑

𝑠1 𝑠2 π‘ π‘ž

π‘₯1 π‘₯2 π‘₯π‘‘βˆ’1 π‘₯𝑑

CS 3750 Advanced Machine Learning

ProbabilisticPrinciple Component Analysis

5

Page 4: Latent Variable Models - University of Pittsburgh

9/27/2018

4

CS 3750 Advanced Machine Learning

Review: Principle Component Analysis

β€’Limitations:β€’ Non-parametric

β€’ Computational intensive for covariance matrix

β€’ Missing data

6

β€’ Project the data onto a low

dimensional space

β€’ Maximum variance of the

projected data

CS 3750 Advanced Machine Learning

Probabilistic Principle Component Analysis

7

Page 5: Latent Variable Models - University of Pittsburgh

9/27/2018

5

CS 3750 Advanced Machine Learning

Probabilistic Principle Component Analysis

β€’ Generative Model for PCA

β€’ Take advantage of probabilistic distributions to compute

β€’ Observed variables become independent given latent variables

β€’ Standard PCA: 𝜎2 β†’ 0

8

…

…Observed variables 𝐱 : 𝒅-dimensions

x = W𝑠 + πœ‡ + Ξ΅ ~𝒩(πœ‡, 𝐢π‘₯)πœ‡ : non zero mean

Ξ΅~𝒩(0, 𝜎2)𝐢π‘₯ = π‘Šπ‘Šπ‘‡ + 𝜎2

Latent variables 𝒔: 𝒒-dimensions

s~𝒩(0, 𝐼)

x|s~𝒩(π‘Šπ‘  + πœ‡, 𝜎2 𝐼)π‘ž < 𝑑

𝑠1 𝑠2 π‘ π‘ž

π‘₯1 π‘₯2 π‘₯π‘‘βˆ’1 π‘₯𝑑

CS 3750 Advanced Machine Learning

Factor analysis

β€’ Latent variable model with a linear relationship:

x ~ Ws + ΞΌ + Ξ΅

‒𝐖 is a 𝑑 x π‘ž matrix that relates observed variables x to the latent variables sβ€’ Latent variables: s ~ N(0, I)β€’ Error (or noise): Ξ΅ ~ N(0, ψ) – Gaussian noiseβ€’ Location term (mean): ΞΌ

Then: x ~ N(ΞΌ, Cx)β€’ where Cx=WWT + ψ is the covariance matrix for observed

variables xβ€’ the model’s parameters W, ΞΌ and ψ can be found using

maximum likelihood estimate

Page 6: Latent Variable Models - University of Pittsburgh

9/27/2018

6

CS 3750 Advanced Machine Learning

Probabilistic PCA (PPCA)

β€’ A special case of the factor analysis modelβ€’ Noise variances constrained to be equal (ψi=Οƒ

2)

x ~ Ws + ΞΌ + Ξ΅β€’ Latent variables: s ~ N(0, I)β€’ Error (or noise): Ξ΅ ~ N(0, Οƒ2I) (isotropic noise model)β€’ Location term (mean): ΞΌ

β€’ x|x ~ N(WX + ΞΌ, Οƒ2I)β€’ x ~ N(ΞΌ, Cx)β€’ where Cx=WWT + Οƒ2I is the covariance matrix of x

β€’ Normal PCA is a limiting case of probabilistic PCA, taken as the limit as the covariance of the noise becomes infinitesimally small (ψ =lim Οƒ2 β†’0 Οƒ

2 I)

CS 3750 Advanced Machine Learning

PPCA (Maximum likelihood PCA)

β€’ Log-likelihood for the Gaussian noise model:

β€’ 𝐿 = βˆ’π‘

2𝑑 ln 2Ο€ + ln Cx + tr Cx

βˆ’1𝐒

Cx=WWT + Οƒ2

β€’ Maximum likelihood estimates for the above:β€’ ΞΌ: mean of the dataβ€’ S (sample covariance matrix of the observations X):

𝐒 =1

𝑁

𝑛=1

𝑁

𝐗𝑛 βˆ’ ΞΌ (𝐗𝑛 βˆ’ ΞΌ)T

β€’ MLE’s for W and Οƒ2 can be solved in two ways:β€’ closed form (Tipping and Bishop)β€’ EM algorithm (Roweis)

Tr(A) = sum of diagonal elements of A

Page 7: Latent Variable Models - University of Pittsburgh

9/27/2018

7

CS 3750 Advanced Machine Learning

Probabilistic PCA

The likelihood is maximized when:

𝐖ML = π”π‘ž(πŸΞ›π‘ž βˆ’ Οƒ2𝐈) 𝐑

β€’ For 𝐖 = 𝐖ML the maximum π”π‘ž is a 𝑑 x π‘ž matrix where the π‘žcolumn vectors are the principal eigenvectors of 𝐒.

β€’ Ξ›π‘ž is a π‘ž x π‘ž diagonal matrix with corresponding eigenvalues along the diagonal.

β€’ 𝐑 is an arbitrary π‘ž x π‘ž orthogonal rotation matrix

β€’ Max likelihood estimate for Οƒ2 is:

Οƒ2ML

=1

𝑑 βˆ’ π‘ž

𝑗=π‘ž+1

𝑑

λ𝑗

β€’ To find the most likely model given 𝐒, estimate Οƒ2ML

and then 𝐖ML with 𝐑 = 𝐈, or you can employ the EM algorithm

CS 3750 Advanced Machine Learning

Review: General Expectation Maximization (EM)

The key idea of a method:

Compute the parameter estimates iteratively by performingthe following two steps:

Two steps of the EM:

1. Expectation step. For all hidden and missing variables (and their possible value assignments) calculate their expectations for the current set of parameters Ξ˜β€™

2. Maximization step. Compute the new estimates of Θ byconsidering the expectations of the different value completions

Stop when no improvement possible

Borrowed from CS2750 Lecture Slides

13

Page 8: Latent Variable Models - University of Pittsburgh

9/27/2018

8

CS 3750 Advanced Machine Learning

EM for Probabilistic PCA

Parameters:

𝝁(mean), 𝑾(weighted matrix), 𝝈𝟐(covariance of noise)

β€’ ML PCA computationally heavy for high dimensional data or largedatasets

β€’ 𝑝 𝑠𝑛 , π‘₯𝑛 = 2πœ‹πœŽ2 βˆ’ ࡗ𝑑 2 exp βˆ’π‘₯nβˆ’π‘Šπ‘ π‘›βˆ’πœ‡

2

2𝜎22πœ‹ βˆ’ ΰ΅—π‘ž 2 exp βˆ’

𝑠𝑛2

2

β€’ 𝐸 π‘™π‘œπ‘” 𝑝 X, |𝑆 πœ‡,π‘Š, 𝜎2

= βˆ’n=1

𝑁

{𝑑

2log 2πœ‹πœŽ2 +

1

2Tr 𝐸 𝑠𝑛𝑠𝑛

𝑇 +1

2𝜎2π‘₯𝑛 βˆ’ πœ‡ 2

βˆ’1

𝜎2𝐸 𝑠𝑛

π‘‡π‘Šπ‘‡ π‘₯𝑛 βˆ’ πœ‡ +1

2𝜎2Tr 𝐸 𝑠𝑛𝑠𝑛

𝑇 𝑀𝑇𝑀

14

CS 3750 Advanced Machine Learning

EM for Probabilistic PCA

Parameters:

𝝁(mean), 𝑾(weighted matrix), 𝝈𝟐(covariance of noise)

β€’ E-step: Derive 𝐸 𝑠𝑛 , 𝐸 𝑠𝑛𝑠𝑛𝑇 from 𝑝 |𝑠𝑛 π‘₯𝑛 , πœƒ using current parameters

𝑝 𝑠𝑛 = 𝑝 |𝑠𝑛 π‘₯𝑛, πœƒ =𝑝 |π‘₯𝑛 𝑠𝑛,πœƒ 𝑃 𝑠𝑛|πœƒ

𝑍~𝒩 𝐸 𝑠𝑛 , 𝐸 𝑠𝑛𝑠𝑛

𝑇

β€’ 𝐸 𝑠𝑛 = π‘€βˆ’1π‘Š π‘₯𝑛 βˆ’ πœ‡

β€’ 𝐸 𝑠𝑛𝑠𝑛𝑇 = 𝜎2π‘€βˆ’1 + 𝐸 𝑠𝑛 𝐸 𝑠𝑛

𝑇

β€’ 𝑀 = π‘Šπ‘‡W+ π‘œ2𝐼

β€’ M-step: Maximum expectation

𝑀 =

𝑛=1

𝑁

π‘₯𝑛 βˆ’ πœ‡ 𝐸 𝑠𝑛𝑇

𝑛=1

𝑁

𝐸 𝑠𝑛𝑠𝑛𝑇

βˆ’1

𝜎2 =1

𝑁𝑑

𝑛=1

𝑁

π‘₯𝑛 βˆ’ πœ‡ 2 βˆ’ 2𝐸 𝑠𝑛𝑇 ΰ·©π‘Šπ‘‡(π‘₯𝑛 βˆ’ πœ‡) + Tr 𝐸 𝑠𝑛𝑠𝑛

𝑇 ΰ·©π‘Šπ‘‡ ΰ·©π‘Š

15

Page 9: Latent Variable Models - University of Pittsburgh

9/27/2018

9

CS 3750 Advanced Machine Learning

Advantages of using EM algorithm in probabilistic PCA models

β€’ Convergence:β€’ Tipping and Bishop showed (1997) that the only stable

local extremum is the global maximum at which the true principal subspace is found

β€’ Complexity:β€’ Methods that explicitly compute the sample covariance

matrix have complexities O(nd2)β€’ EM algorithm does not require computation of the

sample covariance matrix, O(dnq)β€’ Huge advantage when q << d (# of principal

components is much smaller than original # of variabes)

CS 3750 Advanced Machine Learning

CooperativeVector Quantizer Model

17

Page 10: Latent Variable Models - University of Pittsburgh

9/27/2018

10

CS 3750 Advanced Machine Learning

Cooperative Vector Quantizer (CVQ) Model

𝑃 X, |𝑆 πœƒ

= 2πœ‹ βˆ’ ࡗ𝑑 2πœŽβˆ’ ࡗ𝑑 2 exp βˆ’1

2𝜎2X βˆ’π‘Šπ‘† 𝑇 X βˆ’π‘Šπ‘† ΰ·‘

𝑖=1

π‘ž

πœ‹π‘–π‘ π‘– 1 βˆ’ πœ‹π‘–

1βˆ’π‘ π‘–

18

…

…

Observed variables 𝐱 : 𝒅-dimensions

π‘₯ = σ𝑖=1π‘ž

𝑠𝑖𝑀𝑖 + νœ€: GaussianΞ΅~𝒩(0, 𝜎2𝐼)

Latent variables 𝒔: 𝒒-dimensions

s ∈ 0,1 π‘ž , 𝑃 |𝑠𝑖 πœ‹π‘– = πœ‹π‘–π‘ π‘– 1 βˆ’ πœ‹π‘–

1βˆ’π‘ π‘–

π‘₯|𝑠~𝒩(

𝑖=1

π‘ž

𝑠𝑖𝑀𝑖 , 𝜎2𝐼)

𝑀𝑖: Gaussianπ‘ž < 𝑑

𝑠1 𝑠2 π‘ π‘ž

π‘₯1 π‘₯2 π‘₯π‘‘βˆ’1 π‘₯𝑑

CS 3750 Advanced Machine Learning

Variational Bayesian Learning of CVQ

β€’ With s unobservable, we need to determine what model structure bestdescribe the data (the model with the highest posterior probability)

β€’ 𝑃 π‘₯ = Οƒ 𝑠 𝑃 π‘₯, 𝑠 , s has 2π‘ž conπ‘“π‘–π‘”π‘’π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›π‘ 

β€’ log 𝑃 |𝑋 πœƒ = 𝑛=1

𝑁log 𝑃 |π‘₯𝑛 πœƒ =

𝑛=1

𝑁

log Οƒ 𝑠 𝑃 π‘₯𝑛, |𝑠𝑛 πœƒ =

𝑛=1

𝑁

log Οƒ 𝑠 𝑷 |𝒔𝒏 𝒙𝒏, 𝜽 𝑃 |π‘₯𝑛 πœƒ

β€’ The computation is exponential, if π‘ž is larger, there will be a bottleneck.

19

Object : Estimate Parameters of the Model: 𝑾,𝝅,𝝈𝟐

If both x and s are observable, it is easy to calculate

𝑛=1

𝑁

log 𝑃 π‘₯𝑛, |𝑠𝑛 πœƒ

= σ𝑛=1𝑁 βˆ’β…† log 𝜎 βˆ’

1

2𝜎2π‘₯𝑛 βˆ’π‘Šπ‘ π‘›

𝑇 + σ𝑖=1q

𝑠𝑛𝑖 log πœ‹π‘– + 1 βˆ’ 𝑠𝑛𝑖 log 1 βˆ’ πœ‹π‘– +c

Page 11: Latent Variable Models - University of Pittsburgh

9/27/2018

11

CS 3750 Advanced Machine Learning

Variational Approximation

Object: Find a lower bound for π’π’π’ˆ 𝑷 |𝒙𝒏 𝜽

log 𝑃 |𝑋 πœƒ, πœ‰ = log 𝑃 X, |𝑆 πœƒ, πœ‰ βˆ’ log 𝑃 |𝑆 X, πœƒ, πœ‰

Average both sides with 𝑄 |𝑆 πœ†πΈ |𝑆 πœ† log 𝑃 |𝑋 πœƒ, πœ‰

= 𝐸 |𝑆 πœ† log 𝑃 𝑆, |𝑋 πœƒ, πœ‰ βˆ’ 𝐸 |𝑆 πœ† log 𝑄 |𝑆 πœ† …………𝐸 |𝑆 πœ†πΉ(𝑃,𝑄)

+𝐸 |𝑆 πœ† log 𝑄 |𝑆 πœ† βˆ’ 𝐸 |𝑆 πœ† log 𝑃 |𝑆 πœƒ, πœ‰ …………𝐸 |𝑆 πœ†πΎπΏ(𝑄, 𝑃)

𝐹 𝑃,𝑄 = Οƒ 𝑠 𝑄 |𝑆 πœ† log 𝑃 𝑆, |𝑋 πœƒ, πœ‰ βˆ’ Οƒ π‘₯ 𝑄 |𝑆 πœ† log 𝑄 |𝑆 πœ†

KL 𝑄,𝑃 = Οƒ 𝑠 𝑄 |𝑆 πœ† [log 𝑄 𝑆|πœ† βˆ’ π‘™π‘œπ‘” 𝑃 𝑆|𝑋, πœƒ ]

Note: if 𝑄(𝑆) is the posterior then the variational EM reduces to the standard EM

20

CS 3750 Advanced Machine Learning

Variational EM

Kullback-Leibler (KL) divergence:β€’ Distance between 2 distribution

β€’ 𝐾𝐿 |𝑃 𝑅 =𝑖𝑃𝑖 log

𝑃𝑖

𝑅𝑖β‰₯ 0 is always positive

KL 𝑄,𝑃 = Οƒ 𝑠 𝑄 |𝑆 πœ† [log 𝑄 𝑆|πœ† βˆ’ π‘™π‘œπ‘” 𝑃 𝑆|𝑋, πœƒ ]

= Οƒ 𝑠 𝑄 |𝑆 πœ† log𝑄 |𝑆 πœ†

𝑃 |𝑆 X,πœƒβ‰₯ 0

log 𝑃 |𝑋 πœƒ, πœ‰ β‰₯ 𝐹 𝑃,𝑄

21

Page 12: Latent Variable Models - University of Pittsburgh

9/27/2018

12

CS 3750 Advanced Machine Learning

Mean Field Approximation

Object: Choose 𝑸 |𝑯 𝝀Motivation:

β€’ Exact calculation of the posterior is intractable

β€’ When one variable is dependent on many other variables, change of each variablemay have limited effect on the original variable.

Haft et al(1997) demonstrated that for any P(𝑆)and 𝑄 𝑆 , if distribution

𝑄 𝑆 =ΰ·‘

n=1

𝑁

𝑄𝑛 𝑠𝑛

One can minimize KL 𝑄||𝑃 iteratively with respect to 𝑄𝑛 𝑠𝑛 while fixing others.

𝑄 𝑆|πœ† =ΰ·‘

n=1

𝑁

𝑄𝑛 𝑠𝑛|πœ†π‘›

In CVQ model, all latent variables are binary,

𝑄 𝑆|πœ† =ΰ·‘

n=1

𝑁

𝑄𝑛 𝑠𝑛|πœ†π‘›

𝑄𝑛 𝑠𝑛|πœ†π‘› =ΰ·‘

𝑖=1

π‘ž

𝑄𝑛 |𝑠𝑛𝑖 πœ†π‘›π‘– =ΰ·‘

𝑖=1

π‘ž

πœ†π‘›π‘–π‘ π‘›π‘– 1βˆ’ πœ†π‘›π‘–

1βˆ’π‘ π‘›π‘–

22

CS 3750 Advanced Machine Learning

Variational EM

𝐹 𝑃,𝑄

= βˆ’β…† π‘™π‘œπ‘”πœŽβˆ’1

2𝜎2π‘₯𝑇π‘₯ βˆ’ 2

𝑖=1

π‘ž

πœ†π‘–π‘€π‘– π‘₯ +

𝑖=1

π‘ž

𝑗=1

π‘ž

πœ†π‘–πœ†π‘— + 𝛿𝑖𝑗 πœ†π‘– βˆ’ πœ†π‘–2 𝑀𝑖

𝑇𝑀𝑗 +

𝑖=1

π‘ž

πœ†π‘– log πœ‹π‘–

+ 1βˆ’ πœ†π‘– log 1 βˆ’ πœ‹π‘– +

𝑖=1

π‘ž

πœ†π‘– log πœ†π‘– + 1βˆ’ πœ†π‘– log 1 βˆ’ πœ†π‘–

πœ•

πœ•πœ†π‘’πΉ = 0

E-step:

Optimize 𝐹 𝑃, 𝑄 with respect to πœ† = πœ†1, πœ†2, … , πœ†π‘ while keeping (π‘Š, πœ‹, 𝜎2) fixed

Iterate a set fixed point equations for all indexes u = 1…k and for all 𝑛

πœ†π‘’ = 𝑔1

𝜎2π‘₯ βˆ’

𝑗≠𝑒

πœ†π‘—π‘€π‘—

𝑇

𝑀𝑒 βˆ’1

2𝜎2𝑀𝑒𝑇𝑀𝑒 + log

πœ‹π‘’1 βˆ’ πœ‹π‘’

𝑔 π‘₯ =1

1 + eβˆ’π‘₯

23

Page 13: Latent Variable Models - University of Pittsburgh

9/27/2018

13

CS 3750 Advanced Machine Learning

Variational EM

𝐹 𝑃, 𝑄

= βˆ’β…† π‘™π‘œπ‘”πœŽβˆ’1

2𝜎2π‘₯𝑇π‘₯ βˆ’ 2

𝑖=1

π‘ž

πœ†π‘–π‘€π‘– π‘₯ +

𝑖=1

π‘ž

𝑗=1

π‘ž

πœ†π‘–πœ†π‘— + 𝛿𝑖𝑗 πœ†π‘– βˆ’ πœ†π‘–2 𝑀𝑖

𝑇𝑀𝑗

+

𝑖=1

π‘ž

πœ†π‘– log πœ‹π‘– + 1 βˆ’ πœ†π‘– log 1 βˆ’ πœ‹π‘– +

𝑖=1

π‘ž

πœ†π‘– log πœ†π‘– + 1 βˆ’ πœ†π‘– log 1 βˆ’ πœ†π‘–

M-step:

Optimize 𝐹 𝑃, 𝑄 with respect to (π‘Š,πœ‹, 𝜎2) while keeping πœ† = πœ†1, πœ†2, … , πœ†π‘

πœ•

πœ•πœ‹π‘’πΉ = 0 β†’ πœ‹π‘’ =

σ𝑛=1𝑁 πœ†π‘›π‘’π‘

πœ•

πœ•π‘€π‘’π‘£πΉ =

𝑛=1

𝑁

βˆ’1

2𝜎2πœ†π‘›π‘£π‘‹π‘›π‘’ + 2

𝑗≠𝑣

πœ†π‘›π‘£πœ†π‘›π‘—π‘Šπ‘’π‘— + 2πœ†π‘›π‘£π‘Šπ‘’π‘£ = 0 β†’π‘Š

πœ•

πœ•πœŽ2𝐹 = βˆ’

β…†

𝜎+

1

𝜎3. = 0 β†’ 𝜎2

24

CS 3750 Advanced Machine Learning

Noisy-OR ComponentAnalyzer

25

Page 14: Latent Variable Models - University of Pittsburgh

9/27/2018

14

CS 3750 Advanced Machine Learning

Noisy-OR

Assumptions:

β€’ All possible causes π‘ˆπ‘– for an event 𝑋 are modeled using nodes (random variables) and their values, with T (or 1) reflecting the presence of the cause , and F (or 0) its absence

β€’ If one needs to represent unknown causes one can add a leak node

β€’ Parameters: For each cause π‘ˆπ‘– define an (independent) probability qi that represents the probability with which the cause does not lead to X = T (or 1), or in other words, it represents the probability that the positive value ofvariable 𝑋 is inhibited when π‘ˆπ‘– is present

Please note that the negated causes Β¬π‘ˆπ‘– (reflecting the absence of the cause) do not have any influence on 𝑋

26

β€’ A generalization of the logical OR

p |π‘₯ = 1 π‘ˆ1, … , π‘ˆj, Β¬π‘ˆπ‘—+1, … , Β¬π‘ˆπ‘˜ = 1 βˆ’Ο‚π‘–=1𝑗

π‘žπ‘–

𝑝 |π‘₯ = 0 π‘ˆ1, … , π‘ˆj, Β¬π‘ˆπ‘—+1, … , Β¬π‘ˆπ‘˜ =ΰ·‘

𝑖=1

𝑗

π‘žπ‘–

CS 3750 Advanced Machine Learning

Noisy-OR Example

πœ‡ |π‘₯ = 1 π‘ˆ1, … , π‘ˆj, Β¬π‘ˆπ‘—+1, … ,Β¬π‘ˆπ‘˜ = 1βˆ’ΰ·‘

𝑖=1

𝑗

π‘žπ‘–

πœ‡ |π‘₯ = 0 π‘ˆ1, … , π‘ˆj, Β¬π‘ˆπ‘—+1, … ,Β¬π‘ˆπ‘˜ =ΰ·‘

𝑖=1

𝑗

π‘žπ‘–

27

Cold Flu Malaria πœ‡(Fever) πœ‡ Β¬πΉπ‘’π‘£π‘’π‘Ÿ

F F F 0 1

F F T 0.9 0.1

F T F 0.8 0.2

F T T 0.98 0.02 = 0.2 Γ— 0.1

T F F 0.4 0.6

T F T 0.94 0.06 = 0.6 Γ— 0.1

T T F 0.88 0.12 = 0.6 Γ— 0.2

T T T 0.988 0.012 = 0.6 Γ— 0.2 Γ— 0.1

FlueColdMala-ria

Fever

q_mal=0.1q_fl=0.2

q_cold=0.6

Page 15: Latent Variable Models - University of Pittsburgh

9/27/2018

15

CS 3750 Advanced Machine Learning

Noisy-OR parameter reduction

28

FlueColdMala-ria

Fever

q_mal=0.1q_fl=0.2

q_cold=0.6

β€’ Please note that in general the number of entries defining the

CPT (conditional probability table) grows exponentially with

the number of parents;

β€’ for q binary parents the number is : 2π‘ž

β€’ For the noisy-or CPT the number of parameters is q + 1

Example:

CPT: 8 different combination

of values for 3 binary parents

Noisy-or: 3 parameters

CS 3750 Advanced Machine Learning

Noisy-OR Component Analyzer (NOCA)

29

…

…Observed variables 𝐱 : 𝒅-dimensions

π‘₯ ∈ 0,1 𝑑

𝑃 π‘₯ =ා

𝑠

ΰ·‘

𝑗=1

𝑑

𝑃 ΰΈ«π‘₯𝑗 𝑠 ΰ·‘

𝑖=1

π‘ž

𝑃 𝑠𝑖

Latent variables 𝒔: (𝒒 + 𝟏)-dimensions

s ∈ 0,1 π‘ž , 𝑃 |𝑠𝑖 πœ‹π‘– = πœ‹π‘–π‘ π‘– 1 βˆ’ πœ‹π‘–

1βˆ’π‘ π‘–

Loading Matrix: 𝒑 = {𝑝𝑖𝑗}𝑗=1,…,𝑑𝑖=1,…,π‘ž

π‘ž < 𝑑

𝑠1 𝑠2 π‘ π‘ž

π‘₯1 π‘₯2 π‘₯π‘‘βˆ’1 π‘₯𝑑

Page 16: Latent Variable Models - University of Pittsburgh

9/27/2018

16

CS 3750 Advanced Machine Learning

Variational EM for NOCA

β€’ Similar to what we did for CVQ, we simplify the distribution with adecomposable Q(s)

log 𝑃 |π‘₯ πœƒ = log ΰ·‘

𝑛=1

𝑁

𝑃 |π‘₯𝑛 πœƒ =ා

𝑛=1

𝑁

log

𝑠

𝑃 π‘₯𝑛, |𝑠𝑛 πœƒ

=ා

𝑛=1

𝑁

log

𝑠

𝑃 π‘₯𝑛, |𝑠𝑛 πœƒ, π‘žπ‘›π‘„ 𝑠𝑛𝑄 𝑠𝑛

β‰₯ා

𝑛=1

𝑁

𝑠𝑛

𝐸𝑠𝑛 log 𝑃 π‘₯𝑛, |𝑆𝑛 πœƒ βˆ’ 𝐸𝑠𝑛 log 𝑄 𝑆𝑛

β€’ log(𝑃 π‘₯𝑛, |𝑠𝑛 πœƒ, π‘žπ‘› still can not be solved easily

β€’ Noisy-Or is not in exponential family

30

CS 3750 Advanced Machine Learning

Variational EM for NOCA

A further lower bound is requiredβ€’ Jensen’s inequality: 𝒇(𝒂 + σ𝒋 𝒒𝒋𝒙𝒋) β‰₯ σ𝒋 π’’π’Šπ’‡ 𝒂 + 𝐱𝒋

𝑃 ΰΈ«π‘₯𝑗 𝑠 = 1 βˆ’ 1 βˆ’ 𝑝0𝑗 ΰ·‘

𝑖=1

π‘ž

1 βˆ’ 𝑝𝑖𝑗𝑠𝑖

π‘₯𝑗

1 βˆ’ 𝑝0𝑗 ΰ·‘

𝑖=1

π‘ž

1 βˆ’ 𝑝𝑖𝑗𝑠𝑖

1βˆ’π‘₯𝑗

π’πžπ­ πœ½π’Šπ’‹ = βˆ’ π₯𝐨𝐠 𝟏 βˆ’ π’‘π’Šπ’‹

𝑃 ΰΈ«π‘₯𝑗 𝑠 = exp π‘₯𝑗 log 1 βˆ’ exp βˆ’πœƒ0𝑗 βˆ’

𝑖=1

π‘ž

πœƒπ‘–π‘—π‘ π‘– + 1 βˆ’ π‘₯𝑗 βˆ’πœƒ0𝑗 βˆ’

𝑖=1

π‘ž

πœƒπ‘–π‘—π‘ π‘–

𝑷 ห𝒙𝒋 𝒔 does not factorize for 𝒙𝒋 = 𝟏

𝑃 ΰΈ«π‘₯𝑗 = 1 𝑠 = exp[log 1 βˆ’ exp βˆ’πœƒ0𝑗 βˆ’π‘–=1

π‘ž

πœƒπ‘–π‘—π‘ π‘– ]

= exp log 1 βˆ’ exp βˆ’πœƒ0𝑗 βˆ’

𝑖=1

π‘ž

πœƒπ‘–π‘—π‘ π‘–π‘žπ‘— 𝑖

π‘žπ‘— 𝑖β‰₯ exp[

𝑖=1

π‘ž

π‘žπ‘—(𝑖) log 1 βˆ’ exp βˆ’πœƒ0𝑗 βˆ’πœƒπ‘–π‘—π‘ π‘–

π‘žπ‘—(𝑖)]

= exp[

𝑖=1

π‘ž

π‘žπ‘— 𝑖 [𝑠𝑖log 1 βˆ’ exp βˆ’πœƒ0𝑗 βˆ’πœƒπ‘–π‘—

π‘žπ‘— 𝑖+ (1 βˆ’ 𝑠𝑖)log 1 βˆ’ exp βˆ’πœƒ0𝑗 ]]

=ෑ𝑖=1

π‘ž

exp[π‘žπ‘— 𝑖 𝑠𝑖[log 1 βˆ’ exp βˆ’πœƒ0𝑗 βˆ’πœƒπ‘–π‘—

π‘žπ‘— π‘–βˆ’ log( 1 βˆ’ exp βˆ’πœƒ0𝑗 )] + π‘žπ‘— 𝑖 log(ΰ΅«1

31

Page 17: Latent Variable Models - University of Pittsburgh

9/27/2018

17

CS 3750 Advanced Machine Learning

Variational EM for NOCA

A further lower bound is required

log 𝑃 |π‘₯ πœƒ

β‰₯ σ𝑛=1𝑁 Οƒ 𝑠𝑛

𝐸𝑠𝑛 log𝑃 π‘₯𝑛, |𝑠𝑛 πœƒ βˆ’ 𝐸𝑠𝑛 log𝑄 𝑠𝑛

β‰₯ σ𝑛=1𝑁 Οƒ 𝑠𝑛

𝐸𝑠𝑛 log෨𝑃 π‘₯𝑛 , |𝑠𝑛 πœƒ, π‘žπ‘› βˆ’ 𝐸𝑠𝑛 log 𝑄 𝑠𝑛

= σ𝑛=1𝑁 Οƒ 𝑠𝑛

𝐸𝑠𝑛 log෨𝑃 π‘₯𝑛|𝑠𝑛 , πœƒ, π‘žπ‘› 𝑃 |𝑠𝑛 πœƒ βˆ’ 𝐸𝑠𝑛 log 𝑄 𝑠𝑛

= 𝑛=1

𝑁ℱ𝑛 π‘₯𝑛, 𝑄 𝑠𝑛

= β„±(π‘₯, 𝑄(𝑠))

32

CS 3750 Advanced Machine Learning

Variational EM for NOCA

33

Parameters: 𝒒𝒏, πœ½π’Šπ’‹, πœ½πŸŽπ’‹β€’ E-step: update 𝒒𝒏 to optimize 𝑭𝒏

β€’ π‘žπ‘›π‘— β…ˆ ⇐ 𝑠𝑛𝑖 𝑄 𝑆𝑛

π‘žπ‘›π‘— β…ˆ

log 1βˆ’β…‡βˆ’πœƒ0𝑗

log 1 βˆ’ 𝐴𝑛 β…ˆ, 𝑗 βˆ’πœƒπ‘–π‘—

π‘žπ‘›π‘— β…ˆ

𝐴𝑛 β…ˆ,𝑗

1βˆ’π΄π‘› β…ˆ,π‘—βˆ’

Page 18: Latent Variable Models - University of Pittsburgh

9/27/2018

18

CS 3750 Advanced Machine Learning

Summary

34

pPCA CVQ NOCA

Latent Variables

s-q

s~𝒩(0, 𝐼)s ∈ 0,1 π‘ž s ∈ 0,1 π‘ž

Observed

Variables x-d

x = W𝑠 + πœ‡ + Ξ΅ ~𝒩(πœ‡, 𝐢)πœ‡ : non zero mean

Ξ΅~𝒩(0, 𝜎2)𝐢 = π‘Šπ‘Šπ‘‡ + 𝜎2

π‘₯ = σ𝑖=1π‘ž

𝑠𝑖𝑀𝑖 + νœ€:

Gaussian

Ξ΅~𝒩(0, 𝜎2𝐼)π‘₯ ∈ 0,1 𝑑

Dependencyx|s~𝒩(π‘Šπ‘  + πœ‡, 𝜎2 𝐼)

π‘₯|𝑠~𝒩(

𝑖=1

π‘ž

𝑠𝑖𝑀𝑖 , 𝜎2𝐼)

𝑀𝑖: Gaussian

Loading Matrix:

𝒑 = {𝑝𝑖𝑗}𝑗=1,…,𝑑𝑖=1,…,π‘ž

Parameter

Estimation MethodML, EM Variational EM

Variational EM

β€’ The dimension reduction mapping can alsobe interpreted as a variational autoencoder.

CS 3750 Advanced Machine Learning

Reference

β€’ CS2750 Spring 2018 Lecture 17 Slides

β€’ Week2 Slides from Coursera Bayesian Methods for Machine Learning

β€’ http://www.blutner.de/Intension/Noisy%20OR.pdf

β€’ C. Bishop. Pattern recognition and Machine Learning. Chapter 12.2.

β€’ Michael E. Tipping, Chris M. Bishop. Probabilistic Principal Component Analysis, 1999. (TR in 1997)

β€’ Sam Roweis. EM Algorithms for PCA and SPCA. NIPS-1998.

β€’ Singliar, Hauskrecht. Noisy-or Component Analysis and its Application to Link Analysis. 2006

β€’ Jaakkola, Tommi S., and Michael I. Jordan. "Variational probabilistic inference and the QMR-DT network." Journal of artificial intelligence research 10 (1999): 291-322.

35