Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
9/27/2018
1
CS 3750 Advanced Machine Learning
CS 3750 Advanced Machine LearningLecture 8
Latent Variable Models
Tianyi Cui
Sept 20, 2018
CS 3750 Advanced Machine Learning
Outline
β’ Latent Variable Models
β’ Probabilistic Principle Component Analysisβ’ Model Formulation
β’ Maximum Likelihood Estimation
β’ Expectation Maximum Algorithm for pPCA
β’ Cooperative Vector Quantizer Modelβ’ Model Formulation
β’ Variational Bayesian Learning for CVQ
β’ Noisy-OR Component Analyzerβ’ Model Formulation
β’ Variational EM for NOCA
β’ References
1
9/27/2018
2
CS 3750 Advanced Machine Learning
Latent Variable Models
2
Name High School Grade
University Grade
IQ score Phone Interview
Onsite Interview
John 4.0 4.0 120 3/4 ?
Helen 3.2 N/A 112 2/4 ?
Sophia 3.5 3.6 N/A 4/4 85/100
Jack 3.6 N/A N/A 3/4 ?
IQscore
GPA
High School Grade
Onsite Interview
Phone Interview
CS 3750 Advanced Machine Learning
Latent Variable Models
3
Name High School Grade
University Grade
IQ score Phone Interview
Onsite Interview
John 4.0 4.0 120 3/4 ?
Helen 3.2 N/A 112 2/4 ?
Sophia 3.5 3.6 N/A 4/4 85/100
Jack 3.6 N/A N/A 3/4 ?
IQscoreGPA
High School Grade
Onsite Interview
Phone Interview
Intelligence
9/27/2018
3
CS 3750 Advanced Machine Learning
Latent Variable Models
Pros:β’ Dimension reductionβ’ Explain correlation of observed data at latent variables levelβ’ Inferring state of latent variables
β’Cons:β’ Hard to work with
4
β¦
β¦ Observed variables π± : π -dimensions
Latent variables π: π-dimensions
π < π
π 1 π 2 π π
π₯1 π₯2 π₯πβ1 π₯π
CS 3750 Advanced Machine Learning
ProbabilisticPrinciple Component Analysis
5
9/27/2018
4
CS 3750 Advanced Machine Learning
Review: Principle Component Analysis
β’Limitations:β’ Non-parametric
β’ Computational intensive for covariance matrix
β’ Missing data
6
β’ Project the data onto a low
dimensional space
β’ Maximum variance of the
projected data
CS 3750 Advanced Machine Learning
Probabilistic Principle Component Analysis
7
9/27/2018
5
CS 3750 Advanced Machine Learning
Probabilistic Principle Component Analysis
β’ Generative Model for PCA
β’ Take advantage of probabilistic distributions to compute
β’ Observed variables become independent given latent variables
β’ Standard PCA: π2 β 0
8
β¦
β¦Observed variables π± : π -dimensions
x = Wπ + π + Ξ΅ ~π©(π, πΆπ₯)π : non zero mean
Ξ΅~π©(0, π2)πΆπ₯ = πππ + π2
Latent variables π: π-dimensions
s~π©(0, πΌ)
x|s~π©(ππ + π, π2 πΌ)π < π
π 1 π 2 π π
π₯1 π₯2 π₯πβ1 π₯π
CS 3750 Advanced Machine Learning
Factor analysis
β’ Latent variable model with a linear relationship:
x ~ Ws + ΞΌ + Ξ΅
β’π is a π x π matrix that relates observed variables x to the latent variables sβ’ Latent variables: s ~ N(0, I)β’ Error (or noise): Ξ΅ ~ N(0, Ο) β Gaussian noiseβ’ Location term (mean): ΞΌ
Then: x ~ N(ΞΌ, Cx)β’ where Cx=WWT + Ο is the covariance matrix for observed
variables xβ’ the modelβs parameters W, ΞΌ and Ο can be found using
maximum likelihood estimate
9/27/2018
6
CS 3750 Advanced Machine Learning
Probabilistic PCA (PPCA)
β’ A special case of the factor analysis modelβ’ Noise variances constrained to be equal (Οi=Ο
2)
x ~ Ws + ΞΌ + Ξ΅β’ Latent variables: s ~ N(0, I)β’ Error (or noise): Ξ΅ ~ N(0, Ο2I) (isotropic noise model)β’ Location term (mean): ΞΌ
β’ x|x ~ N(WX + ΞΌ, Ο2I)β’ x ~ N(ΞΌ, Cx)β’ where Cx=WWT + Ο2I is the covariance matrix of x
β’ Normal PCA is a limiting case of probabilistic PCA, taken as the limit as the covariance of the noise becomes infinitesimally small (Ο =lim Ο2 β0 Ο
2 I)
CS 3750 Advanced Machine Learning
PPCA (Maximum likelihood PCA)
β’ Log-likelihood for the Gaussian noise model:
β’ πΏ = βπ
2π ln 2Ο + ln Cx + tr Cx
β1π
Cx=WWT + Ο2
β’ Maximum likelihood estimates for the above:β’ ΞΌ: mean of the dataβ’ S (sample covariance matrix of the observations X):
π =1
π
π=1
π
ππ β ΞΌ (ππ β ΞΌ)T
β’ MLEβs for W and Ο2 can be solved in two ways:β’ closed form (Tipping and Bishop)β’ EM algorithm (Roweis)
Tr(A) = sum of diagonal elements of A
9/27/2018
7
CS 3750 Advanced Machine Learning
Probabilistic PCA
The likelihood is maximized when:
πML = ππ(πΞπ β Ο2π) π
β’ For π = πML the maximum ππ is a π x π matrix where the πcolumn vectors are the principal eigenvectors of π.
β’ Ξπ is a π x π diagonal matrix with corresponding eigenvalues along the diagonal.
β’ π is an arbitrary π x π orthogonal rotation matrix
β’ Max likelihood estimate for Ο2 is:
Ο2ML
=1
π β π
π=π+1
π
Ξ»π
β’ To find the most likely model given π, estimate Ο2ML
and then πML with π = π, or you can employ the EM algorithm
CS 3750 Advanced Machine Learning
Review: General Expectation Maximization (EM)
The key idea of a method:
Compute the parameter estimates iteratively by performingthe following two steps:
Two steps of the EM:
1. Expectation step. For all hidden and missing variables (and their possible value assignments) calculate their expectations for the current set of parameters Ξβ
2. Maximization step. Compute the new estimates of Ξ byconsidering the expectations of the different value completions
Stop when no improvement possible
Borrowed from CS2750 Lecture Slides
13
9/27/2018
8
CS 3750 Advanced Machine Learning
EM for Probabilistic PCA
Parameters:
π(mean), πΎ(weighted matrix), ππ(covariance of noise)
β’ ML PCA computationally heavy for high dimensional data or largedatasets
β’ π π π , π₯π = 2ππ2 β ΰ΅π 2 exp βπ₯nβππ πβπ
2
2π22π β ΰ΅π 2 exp β
π π2
2
β’ πΈ πππ π X, |π π,π, π2
= βn=1
π
{π
2log 2ππ2 +
1
2Tr πΈ π ππ π
π +1
2π2π₯π β π 2
β1
π2πΈ π π
πππ π₯π β π +1
2π2Tr πΈ π ππ π
π π€ππ€
14
CS 3750 Advanced Machine Learning
EM for Probabilistic PCA
Parameters:
π(mean), πΎ(weighted matrix), ππ(covariance of noise)
β’ E-step: Derive πΈ π π , πΈ π ππ ππ from π |π π π₯π , π using current parameters
π π π = π |π π π₯π, π =π |π₯π π π,π π π π|π
π~π© πΈ π π , πΈ π ππ π
π
β’ πΈ π π = πβ1π π₯π β π
β’ πΈ π ππ ππ = π2πβ1 + πΈ π π πΈ π π
π
β’ π = ππW+ π2πΌ
β’ M-step: Maximum expectation
π€ =
π=1
π
π₯π β π πΈ π ππ
π=1
π
πΈ π ππ ππ
β1
π2 =1
ππ
π=1
π
π₯π β π 2 β 2πΈ π ππ ΰ·©ππ(π₯π β π) + Tr πΈ π ππ π
π ΰ·©ππ ΰ·©π
15
9/27/2018
9
CS 3750 Advanced Machine Learning
Advantages of using EM algorithm in probabilistic PCA models
β’ Convergence:β’ Tipping and Bishop showed (1997) that the only stable
local extremum is the global maximum at which the true principal subspace is found
β’ Complexity:β’ Methods that explicitly compute the sample covariance
matrix have complexities O(nd2)β’ EM algorithm does not require computation of the
sample covariance matrix, O(dnq)β’ Huge advantage when q << d (# of principal
components is much smaller than original # of variabes)
CS 3750 Advanced Machine Learning
CooperativeVector Quantizer Model
17
9/27/2018
10
CS 3750 Advanced Machine Learning
Cooperative Vector Quantizer (CVQ) Model
π X, |π π
= 2π β ΰ΅π 2πβ ΰ΅π 2 exp β1
2π2X βππ π X βππ ΰ·
π=1
π
πππ π 1 β ππ
1βπ π
18
β¦
β¦
Observed variables π± : π -dimensions
π₯ = Οπ=1π
π ππ€π + ν: GaussianΞ΅~π©(0, π2πΌ)
Latent variables π: π-dimensions
s β 0,1 π , π |π π ππ = πππ π 1 β ππ
1βπ π
π₯|π ~π©(
π=1
π
π ππ€π , π2πΌ)
π€π: Gaussianπ < π
π 1 π 2 π π
π₯1 π₯2 π₯πβ1 π₯π
CS 3750 Advanced Machine Learning
Variational Bayesian Learning of CVQ
β’ With s unobservable, we need to determine what model structure bestdescribe the data (the model with the highest posterior probability)
β’ π π₯ = Ο π π π₯, π , s has 2π conππππ’πππ‘ππππ
β’ log π |π π = π=1
πlog π |π₯π π =
π=1
π
log Ο π π π₯π, |π π π =
π=1
π
log Ο π π· |ππ ππ, π½ π |π₯π π
β’ The computation is exponential, if π is larger, there will be a bottleneck.
19
Object : Estimate Parameters of the Model: πΎ,π ,ππ
If both x and s are observable, it is easy to calculate
π=1
π
log π π₯π, |π π π
= Οπ=1π ββ log π β
1
2π2π₯π βππ π
π + Οπ=1q
π ππ log ππ + 1 β π ππ log 1 β ππ +c
9/27/2018
11
CS 3750 Advanced Machine Learning
Variational Approximation
Object: Find a lower bound for πππ π· |ππ π½
log π |π π, π = log π X, |π π, π β log π |π X, π, π
Average both sides with π |π ππΈ |π π log π |π π, π
= πΈ |π π log π π, |π π, π β πΈ |π π log π |π π β¦β¦β¦β¦πΈ |π ππΉ(π,π)
+πΈ |π π log π |π π β πΈ |π π log π |π π, π β¦β¦β¦β¦πΈ |π ππΎπΏ(π, π)
πΉ π,π = Ο π π |π π log π π, |π π, π β Ο π₯ π |π π log π |π π
KL π,π = Ο π π |π π [log π π|π β πππ π π|π, π ]
Note: if π(π) is the posterior then the variational EM reduces to the standard EM
20
CS 3750 Advanced Machine Learning
Variational EM
Kullback-Leibler (KL) divergence:β’ Distance between 2 distribution
β’ πΎπΏ |π π =πππ log
ππ
π πβ₯ 0 is always positive
KL π,π = Ο π π |π π [log π π|π β πππ π π|π, π ]
= Ο π π |π π logπ |π π
π |π X,πβ₯ 0
log π |π π, π β₯ πΉ π,π
21
9/27/2018
12
CS 3750 Advanced Machine Learning
Mean Field Approximation
Object: Choose πΈ |π― πMotivation:
β’ Exact calculation of the posterior is intractable
β’ When one variable is dependent on many other variables, change of each variablemay have limited effect on the original variable.
Haft et al(1997) demonstrated that for any P(π)and π π , if distribution
π π =ΰ·
n=1
π
ππ π π
One can minimize KL π||π iteratively with respect to ππ π π while fixing others.
π π|π =ΰ·
n=1
π
ππ π π|ππ
In CVQ model, all latent variables are binary,
π π|π =ΰ·
n=1
π
ππ π π|ππ
ππ π π|ππ =ΰ·
π=1
π
ππ |π ππ πππ =ΰ·
π=1
π
ππππ ππ 1β πππ
1βπ ππ
22
CS 3750 Advanced Machine Learning
Variational EM
πΉ π,π
= ββ ππππβ1
2π2π₯ππ₯ β 2
π=1
π
πππ€π π₯ +
π=1
π
π=1
π
ππππ + πΏππ ππ β ππ2 π€π
ππ€π +
π=1
π
ππ log ππ
+ 1β ππ log 1 β ππ +
π=1
π
ππ log ππ + 1β ππ log 1 β ππ
π
πππ’πΉ = 0
E-step:
Optimize πΉ π, π with respect to π = π1, π2, β¦ , ππ while keeping (π, π, π2) fixed
Iterate a set fixed point equations for all indexes u = 1β¦k and for all π
ππ’ = π1
π2π₯ β
πβ π’
πππ€π
π
π€π’ β1
2π2π€π’ππ€π’ + log
ππ’1 β ππ’
π π₯ =1
1 + eβπ₯
23
9/27/2018
13
CS 3750 Advanced Machine Learning
Variational EM
πΉ π, π
= ββ ππππβ1
2π2π₯ππ₯ β 2
π=1
π
πππ€π π₯ +
π=1
π
π=1
π
ππππ + πΏππ ππ β ππ2 π€π
ππ€π
+
π=1
π
ππ log ππ + 1 β ππ log 1 β ππ +
π=1
π
ππ log ππ + 1 β ππ log 1 β ππ
M-step:
Optimize πΉ π, π with respect to (π,π, π2) while keeping π = π1, π2, β¦ , ππ
π
πππ’πΉ = 0 β ππ’ =
Οπ=1π πππ’π
π
ππ€π’π£πΉ =
π=1
π
β1
2π2πππ£πππ’ + 2
πβ π£
πππ£πππππ’π + 2πππ£ππ’π£ = 0 βπ
π
ππ2πΉ = β
β
π+
1
π3. = 0 β π2
24
CS 3750 Advanced Machine Learning
Noisy-OR ComponentAnalyzer
25
9/27/2018
14
CS 3750 Advanced Machine Learning
Noisy-OR
Assumptions:
β’ All possible causes ππ for an event π are modeled using nodes (random variables) and their values, with T (or 1) reflecting the presence of the cause , and F (or 0) its absence
β’ If one needs to represent unknown causes one can add a leak node
β’ Parameters: For each cause ππ define an (independent) probability qi that represents the probability with which the cause does not lead to X = T (or 1), or in other words, it represents the probability that the positive value ofvariable π is inhibited when ππ is present
Please note that the negated causes Β¬ππ (reflecting the absence of the cause) do not have any influence on π
26
β’ A generalization of the logical OR
p |π₯ = 1 π1, β¦ , πj, Β¬ππ+1, β¦ , Β¬ππ = 1 βΟπ=1π
ππ
π |π₯ = 0 π1, β¦ , πj, Β¬ππ+1, β¦ , Β¬ππ =ΰ·
π=1
π
ππ
CS 3750 Advanced Machine Learning
Noisy-OR Example
π |π₯ = 1 π1, β¦ , πj, Β¬ππ+1, β¦ ,Β¬ππ = 1βΰ·
π=1
π
ππ
π |π₯ = 0 π1, β¦ , πj, Β¬ππ+1, β¦ ,Β¬ππ =ΰ·
π=1
π
ππ
27
Cold Flu Malaria π(Fever) π Β¬πΉππ£ππ
F F F 0 1
F F T 0.9 0.1
F T F 0.8 0.2
F T T 0.98 0.02 = 0.2 Γ 0.1
T F F 0.4 0.6
T F T 0.94 0.06 = 0.6 Γ 0.1
T T F 0.88 0.12 = 0.6 Γ 0.2
T T T 0.988 0.012 = 0.6 Γ 0.2 Γ 0.1
FlueColdMala-ria
Fever
q_mal=0.1q_fl=0.2
q_cold=0.6
9/27/2018
15
CS 3750 Advanced Machine Learning
Noisy-OR parameter reduction
28
FlueColdMala-ria
Fever
q_mal=0.1q_fl=0.2
q_cold=0.6
β’ Please note that in general the number of entries defining the
CPT (conditional probability table) grows exponentially with
the number of parents;
β’ for q binary parents the number is : 2π
β’ For the noisy-or CPT the number of parameters is q + 1
Example:
CPT: 8 different combination
of values for 3 binary parents
Noisy-or: 3 parameters
CS 3750 Advanced Machine Learning
Noisy-OR Component Analyzer (NOCA)
29
β¦
β¦Observed variables π± : π -dimensions
π₯ β 0,1 π
π π₯ =ΰ·
π
ΰ·
π=1
π
π ΰΈ«π₯π π ΰ·
π=1
π
π π π
Latent variables π: (π + π)-dimensions
s β 0,1 π , π |π π ππ = πππ π 1 β ππ
1βπ π
Loading Matrix: π = {πππ}π=1,β¦,ππ=1,β¦,π
π < π
π 1 π 2 π π
π₯1 π₯2 π₯πβ1 π₯π
9/27/2018
16
CS 3750 Advanced Machine Learning
Variational EM for NOCA
β’ Similar to what we did for CVQ, we simplify the distribution with adecomposable Q(s)
log π |π₯ π = log ΰ·
π=1
π
π |π₯π π =ΰ·
π=1
π
log
π
π π₯π, |π π π
=ΰ·
π=1
π
log
π
π π₯π, |π π π, πππ π ππ π π
β₯ΰ·
π=1
π
π π
πΈπ π log π π₯π, |ππ π β πΈπ π log π ππ
β’ log(π π₯π, |π π π, ππ still can not be solved easily
β’ Noisy-Or is not in exponential family
30
CS 3750 Advanced Machine Learning
Variational EM for NOCA
A further lower bound is requiredβ’ Jensenβs inequality: π(π + Οπ ππππ) β₯ Οπ πππ π + π±π
π ΰΈ«π₯π π = 1 β 1 β π0π ΰ·
π=1
π
1 β ππππ π
π₯π
1 β π0π ΰ·
π=1
π
1 β ππππ π
1βπ₯π
πππ π½ππ = β π₯π¨π π β πππ
π ΰΈ«π₯π π = exp π₯π log 1 β exp βπ0π β
π=1
π
ππππ π + 1 β π₯π βπ0π β
π=1
π
ππππ π
π· ΰΈ«ππ π does not factorize for ππ = π
π ΰΈ«π₯π = 1 π = exp[log 1 β exp βπ0π βπ=1
π
ππππ π ]
= exp log 1 β exp βπ0π β
π=1
π
ππππ πππ π
ππ πβ₯ exp[
π=1
π
ππ(π) log 1 β exp βπ0π βππππ π
ππ(π)]
= exp[
π=1
π
ππ π [π πlog 1 β exp βπ0π βπππ
ππ π+ (1 β π π)log 1 β exp βπ0π ]]
=ΰ·π=1
π
exp[ππ π π π[log 1 β exp βπ0π βπππ
ππ πβ log( 1 β exp βπ0π )] + ππ π log(ΰ΅«1
31
9/27/2018
17
CS 3750 Advanced Machine Learning
Variational EM for NOCA
A further lower bound is required
log π |π₯ π
β₯ Οπ=1π Ο π π
πΈπ π logπ π₯π, |π π π β πΈπ π logπ π π
β₯ Οπ=1π Ο π π
πΈπ π logΰ·¨π π₯π , |π π π, ππ β πΈπ π log π π π
= Οπ=1π Ο π π
πΈπ π logΰ·¨π π₯π|π π , π, ππ π |π π π β πΈπ π log π π π
= π=1
πβ±π π₯π, π π π
= β±(π₯, π(π ))
32
CS 3750 Advanced Machine Learning
Variational EM for NOCA
33
Parameters: ππ, π½ππ, π½ππβ’ E-step: update ππ to optimize ππ
β’ πππ β β π ππ π ππ
πππ β
log 1ββ βπ0π
log 1 β π΄π β , π βπππ
πππ β
π΄π β ,π
1βπ΄π β ,πβ
9/27/2018
18
CS 3750 Advanced Machine Learning
Summary
34
pPCA CVQ NOCA
Latent Variables
s-q
s~π©(0, πΌ)s β 0,1 π s β 0,1 π
Observed
Variables x-d
x = Wπ + π + Ξ΅ ~π©(π, πΆ)π : non zero mean
Ξ΅~π©(0, π2)πΆ = πππ + π2
π₯ = Οπ=1π
π ππ€π + ν:
Gaussian
Ξ΅~π©(0, π2πΌ)π₯ β 0,1 π
Dependencyx|s~π©(ππ + π, π2 πΌ)
π₯|π ~π©(
π=1
π
π ππ€π , π2πΌ)
π€π: Gaussian
Loading Matrix:
π = {πππ}π=1,β¦,ππ=1,β¦,π
Parameter
Estimation MethodML, EM Variational EM
Variational EM
β’ The dimension reduction mapping can alsobe interpreted as a variational autoencoder.
CS 3750 Advanced Machine Learning
Reference
β’ CS2750 Spring 2018 Lecture 17 Slides
β’ Week2 Slides from Coursera Bayesian Methods for Machine Learning
β’ http://www.blutner.de/Intension/Noisy%20OR.pdf
β’ C. Bishop. Pattern recognition and Machine Learning. Chapter 12.2.
β’ Michael E. Tipping, Chris M. Bishop. Probabilistic Principal Component Analysis, 1999. (TR in 1997)
β’ Sam Roweis. EM Algorithms for PCA and SPCA. NIPS-1998.
β’ Singliar, Hauskrecht. Noisy-or Component Analysis and its Application to Link Analysis. 2006
β’ Jaakkola, Tommi S., and Michael I. Jordan. "Variational probabilistic inference and the QMR-DT network." Journal of artificial intelligence research 10 (1999): 291-322.
35