Upload
others
View
4
Download
1
Embed Size (px)
Citation preview
Tony Jebara, Columbia University
Machine Learning 4771
Instructor: Tony Jebara
Tony Jebara, Columbia University
Topic 9 • Continuous Probability Models
• Gaussian Distribution
• Maximum Likelihood Gaussian
• Sampling from a Gaussian
Tony Jebara, Columbia University
Continuous Probability Models • Probabilities can have both discrete & continuous variables
• We will discuss: 1) discrete probability tables
2) continuous probability distributions
• Most popular continuous distribution = Gaussian
0.1 0.1 0.1 0.1 0.1 0.5 x=1 x=2 x=3 x=4 x=5 x=6
0.4 0.6 x=T x=H
Tony Jebara, Columbia University
Gaussian Distribution • Recall 1-dimensional Gaussian with mean parameter µ translates Gaussian left & right
• Can also have variance parameter σ2
widens or narrows the Gaussian
p x( )
x=−∞
∞
∫ dx = 1Note:
p x | µ( ) = 1
2πexp − 1
2x − µ( )2⎛
⎝⎜⎜⎜
⎞⎠⎟⎟⎟
p x | µ,σ( ) = 1
2πσ2exp − 1
2σ2x − µ( )2⎛
⎝⎜⎜⎜
⎞⎠⎟⎟⎟
Tony Jebara, Columbia University
• Gaussian can extend to D-dimensions
• Gaussian mean parameter µ vector, it translates the bump • Covariance matrix Σ stretches and rotates bump
• Mean is any real vector • Max and expectation = µ • Variance parameter is now Σ matrix • Covariance matrix is positive definite • Covariance matrix is symmetric • Need matrix inverse (inv) • Need matrix determinant (det) • Need matrix trace operator (trace)
Multivariate Gaussian
px |µ,Σ( ) = 1
2π( )D/2Σ
exp − 12
x −µ( )T Σ−1 x −
µ( )⎛
⎝⎜⎜⎜
⎞⎠⎟⎟⎟
x ∈ D,
µ ∈ D,Σ ∈ D×D
Tony Jebara, Columbia University
Multivariate Gaussian • Spherical:
• Diagonal Covariance: dimensions of x are independent product of multiple 1d Gaussians
Σ = σ2I =σ2 0 00 σ2 00 0 σ2
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
Σ =
σ 1( )2
0 0 0
0σ 2( )2
0 0
0 0 0
0 0 0σ D( )2
⎡
⎣
⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
px |µ,Σ( ) = 1
2πσ d( )
exp −x d( )−µ d( )( )2
2σ d( )2
⎛
⎝⎜⎜⎜⎜
⎞
⎠⎟⎟⎟⎟d=1
D∏
Tony Jebara, Columbia University
Max Likelihood Gaussian • Have IID samples as vectors i=1..N:
• How do we recover the mean and covariance parameters? • Standard approach: Maximum Likelihood (IID) • Maximize probability of data given model (likelihood)
• Instead, work with maximum of log-likelihood
X =
x
1,x
2,…,x
N{ }
log p
x
i|µ,Σ( )
i=1
N
∑ = log 1
2π( )D/2Σ
exp − 12
x
i−µ( )T Σ−1 x
i−µ( )⎛
⎝⎜⎜⎜
⎞⎠⎟⎟⎟
i=1
N
∑
p X | θ( ) = px
1,x
2,…,x
N| θ( )
= px
i|µ
i,Σ
i( )i=1
N∏ independentGaussian samples
= px
i|µ,Σ( )i=1
N∏ identically distributed
Tony Jebara, Columbia University
Max Likelihood Gaussian
∂∂µ
log 1
2π( )D/2Σ
exp − 12
x
i−µ( )T Σ−1 x
i−µ( )⎛
⎝⎜⎜⎜
⎞⎠⎟⎟⎟
i=1
N
∑⎛
⎝⎜⎜⎜⎜
⎞
⎠⎟⎟⎟⎟
= 0
∂∂µ
− D2log 2π− 1
2log Σ − 1
2
x
i−µ( )T Σ−1 x
i−µ( )
i=1
N
∑⎛
⎝⎜⎜⎜⎜
⎞
⎠⎟⎟⎟⎟
= 0
x
i−µ( )T Σ−1
i=1
N
∑ =0
µ = 1
N
x
ii=1
N
∑
• Max over µ
see Jordan Ch. 12, get sample mean…
• For Σ need Trace operator:
and several properties:
∂xT x∂x
= 2xT
tr A( ) = tr AT( ) = A d,d( )d=1
D∑tr AB( ) = tr BA( )tr BAB−1( ) = tr A( )trxxTA( ) = tr
xTAx( ) =
xTAx
Tony Jebara, Columbia University
Max Likelihood Gaussian • Likelihood rewritten in trace notation:
• Max over A=Σ-1 use properties:
• Get sample covariance:
l = − D2log 2π− 1
2log Σ − 1
2
x
i−µ( )T Σ−1 x
i−µ( )i=1
N∑= − ND
2log 2π+ N
2log Σ−1 − 1
2trx
i−µ( )T Σ−1 x
i−µ( )⎡
⎣⎢⎢
⎤
⎦⎥⎥i=1
N∑
= − ND2
log 2π+ N2log Σ−1 − 1
2trx
i−µ( ) xi
−µ( )T Σ−1⎡
⎣⎢⎢
⎤
⎦⎥⎥i=1
N∑
= − ND2
log 2π+ N2log A − 1
2trx
i−µ( ) xi
−µ( )T A
⎡
⎣⎢⎢
⎤
⎦⎥⎥i=1
N∑
∂ log A
∂A= A−1( )T
∂tr BA⎡⎣⎢⎤⎦⎥
∂A= BT
∂l∂A
= −0 + N2
A−1( )T − 12
x
i−µ( ) xi
−µ( )T⎡
⎣⎢⎢
⎤
⎦⎥⎥
T
i=1
N∑
= N2Σ− 1
2
x
i−µ( ) xi
−µ( )Ti=1
N∑
∂l∂A
= 0 → Σ = 1N
x
i−µ( ) xi
−µ( )Ti=1
N∑
Tony Jebara, Columbia University
Sampling & Max Likelihood
Generative Learning
(Maximum Likelihood)
Sampling
Generative Model
Parameters Θ
Data i=1..N IID
observations {x1,…,xN}
Tony Jebara, Columbia University
Sampling from a Gaussian • Fit Gaussian to data, how is this Generative?
Tony Jebara, Columbia University
Sampling from a Gaussian • Fit Gaussian to data, how is this Generative?
• Sampling! Generating discrete data easy:
• Assume we can do uniform sampling: i.e. rand between (0,1) if 0.00 <= rand < 0.73 get A if 0.73 <= rand < 0.83 get B if 0.83 <= rand < 1.00 get C
• What are we doing?
0.73 0.1 0.17
Tony Jebara, Columbia University
Sampling from a Gaussian • Fit Gaussian to data, how is this Generative?
• Sampling! Generating discrete data easy:
• Assume we can do uniform sampling: i.e. rand between (0,1) if 0.00 <= rand < 0.73 get A if 0.73 <= rand < 0.83 get B if 0.83 <= rand < 1.00 get C
• What are we doing?
Sum up the Probability Density Function (PDF) to get Cumulative Density Function (CDF)
• For 1d Gaussian, Integrate Probability Density Function get Cumulative Density Function Integral is like summing many discrete bars
0.73 0.1 0.17
0.73 0.83 1.00
Tony Jebara, Columbia University
Sampling from a Gaussian • Integrate 1d Gaussian to get CDF:
• If sample from uniform, get:
• Compute mapping:
• This is a Gaussian sample:
• For D-dimensional Gaussian N(z|0,I) concatenate samples:
• For N(z|µ,Σ), add mean & multiply by root cov
• Example code: gendata.m
p x( ) = 1
2πexp − 1
2x 2( )
F x( ) = p t( )dt
−∞
x
∫ = 12erf 1
2x( ) + 1
2
u uniform 0,1( )
x = F−1 u( ) = 2erfinv 2u −1( ) x N x | 0,1( )
x =
x 1( )… x D( )⎡⎣⎢
⎤⎦⎥T p
x | 0,I( ) = 1
2πexp − 1
2
x d( )2⎛
⎝⎜⎜⎜
⎞⎠⎟⎟⎟d=1
D∏
z = Σ1/2 x +
µ p
z |µ,Σ( )