Machine Learning - Columbia Universityjebara/4771/notes/class9x.pdf · 2014-10-21 · Machine Learning 4771 Instructor: Tony Jebara . Tony Jebara, Columbia University Topic 9

Tony Jebara, Columbia University

Machine Learning 4771

Instructor: Tony Jebara


Topic 9 • Continuous Probability Models

• Gaussian Distribution

• Maximum Likelihood Gaussian

• Sampling from a Gaussian


Continuous Probability Models • Probabilities can have both discrete & continuous variables

• We will discuss: 1) discrete probability tables

2) continuous probability distributions

• Most popular continuous distribution = Gaussian

0.1 0.1 0.1 0.1 0.1 0.5 x=1 x=2 x=3 x=4 x=5 x=6

0.4 0.6 x=T x=H


Gaussian Distribution • Recall 1-dimensional Gaussian with mean parameter µ translates Gaussian left & right

• Can also have variance parameter σ2

widens or narrows the Gaussian

p x( )

x=−∞

∞

∫ dx = 1Note:

p x | µ( ) = 1

2πexp − 1

2x − µ( )2⎛

⎝⎜⎜⎜

⎞⎠⎟⎟⎟

p x | µ,σ( ) = 1

2πσ2exp − 1

2σ2x − µ( )2⎛

⎝⎜⎜⎜

⎞⎠⎟⎟⎟


• Gaussian can extend to D-dimensions

• Gaussian mean parameter µ vector, it translates the bump • Covariance matrix Σ stretches and rotates bump

• Mean is any real vector • Max and expectation = µ • Variance parameter is now Σ matrix • Covariance matrix is positive definite • Covariance matrix is symmetric • Need matrix inverse (inv) • Need matrix determinant (det) • Need matrix trace operator (trace)

Multivariate Gaussian

px |µ,Σ( ) = 1

2π( )D/2Σ

exp − 12

x −µ( )T Σ−1 x −

µ( )⎛

⎝⎜⎜⎜

⎞⎠⎟⎟⎟

x ∈ D,

µ ∈ D,Σ ∈ D×D


Multivariate Gaussian • Spherical:

• Diagonal Covariance: dimensions of x are independent product of multiple 1d Gaussians

Σ = σ2I =σ2 0 00 σ2 00 0 σ2

⎡

⎣

⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥

Σ =

σ 1( )2

0 0 0

0σ 2( )2

0 0

0 0 0

0 0 0σ D( )2

⎡

⎣

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

px |µ,Σ( ) = 1

2πσ d( )

exp −x d( )−µ d( )( )2

2σ d( )2

⎛

⎝⎜⎜⎜⎜

⎞

⎠⎟⎟⎟⎟d=1

D∏


Max Likelihood Gaussian • Have IID samples as vectors i=1..N:

• How do we recover the mean and covariance parameters? • Standard approach: Maximum Likelihood (IID) • Maximize probability of data given model (likelihood)

• Instead, work with maximum of log-likelihood

X =

x

1,x

2,…,x

N{ }

log p

x

i|µ,Σ( )

i=1

N

∑ = log 1

2π( )D/2Σ

exp − 12

x

i−µ( )T Σ−1 x

i−µ( )⎛

⎝⎜⎜⎜

⎞⎠⎟⎟⎟

i=1

N

∑

p X | θ( ) = px

1,x

2,…,x

N| θ( )

= px

i|µ

i,Σ

i( )i=1

N∏ independentGaussian samples

= px

i|µ,Σ( )i=1

N∏ identically distributed


Max Likelihood Gaussian

∂∂µ

log 1

2π( )D/2Σ

exp − 12

x

i−µ( )T Σ−1 x

i−µ( )⎛

⎝⎜⎜⎜

⎞⎠⎟⎟⎟

i=1

N

∑⎛

⎝⎜⎜⎜⎜

⎞

⎠⎟⎟⎟⎟

= 0

∂∂µ

− D2log 2π− 1

2log Σ − 1

2

x

i−µ( )T Σ−1 x

i−µ( )

i=1

N

∑⎛

⎝⎜⎜⎜⎜

⎞

⎠⎟⎟⎟⎟

= 0

x

i−µ( )T Σ−1

i=1

N

∑ =0

µ = 1

N

x

ii=1

N

∑

• Max over µ

see Jordan Ch. 12, get sample mean…

• For Σ need Trace operator:

and several properties:

∂xT x∂x

= 2xT

tr A( ) = tr AT( ) = A d,d( )d=1

D∑tr AB( ) = tr BA( )tr BAB−1( ) = tr A( )trxxTA( ) = tr

xTAx( ) =

xTAx


Max Likelihood Gaussian • Likelihood rewritten in trace notation:

• Max over A=Σ-1 use properties:

• Get sample covariance:

l = − D2log 2π− 1

2log Σ − 1

2

x

i−µ( )T Σ−1 x

i−µ( )i=1

N∑= − ND

2log 2π+ N

2log Σ−1 − 1

2trx

i−µ( )T Σ−1 x

i−µ( )⎡

⎣⎢⎢

⎤

⎦⎥⎥i=1

N∑

= − ND2

log 2π+ N2log Σ−1 − 1

2trx

i−µ( ) xi

−µ( )T Σ−1⎡

⎣⎢⎢

⎤

⎦⎥⎥i=1

N∑

= − ND2

log 2π+ N2log A − 1

2trx

i−µ( ) xi

−µ( )T A

⎡

⎣⎢⎢

⎤

⎦⎥⎥i=1

N∑

∂ log A

∂A= A−1( )T

∂tr BA⎡⎣⎢⎤⎦⎥

∂A= BT

∂l∂A

= −0 + N2

A−1( )T − 12

x

i−µ( ) xi

−µ( )T⎡

⎣⎢⎢

⎤

⎦⎥⎥

T

i=1

N∑

= N2Σ− 1

2

x

i−µ( ) xi

−µ( )Ti=1

N∑

∂l∂A

= 0 → Σ = 1N

x

i−µ( ) xi

−µ( )Ti=1

N∑


Sampling & Max Likelihood

Generative Learning

(Maximum Likelihood)

Sampling

Generative Model

Parameters Θ

Data i=1..N IID

observations {x1,…,xN}


Sampling from a Gaussian • Fit Gaussian to data, how is this Generative?



• Sampling! Generating discrete data easy:

• Assume we can do uniform sampling: i.e. rand between (0,1) if 0.00 <= rand < 0.73 get A if 0.73 <= rand < 0.83 get B if 0.83 <= rand < 1.00 get C

• What are we doing?

0.73 0.1 0.17



• Sampling! Generating discrete data easy:

• Assume we can do uniform sampling: i.e. rand between (0,1) if 0.00 <= rand < 0.73 get A if 0.73 <= rand < 0.83 get B if 0.83 <= rand < 1.00 get C

• What are we doing?

Sum up the Probability Density Function (PDF) to get Cumulative Density Function (CDF)

• For 1d Gaussian, Integrate Probability Density Function get Cumulative Density Function Integral is like summing many discrete bars

0.73 0.1 0.17

0.73 0.83 1.00


Sampling from a Gaussian • Integrate 1d Gaussian to get CDF:

• If sample from uniform, get:

• Compute mapping:

• This is a Gaussian sample:

• For D-dimensional Gaussian N(z|0,I) concatenate samples:

• For N(z|µ,Σ), add mean & multiply by root cov

• Example code: gendata.m

p x( ) = 1

2πexp − 1

2x 2( )

F x( ) = p t( )dt

−∞

x

∫ = 12erf 1

2x( ) + 1

2

u uniform 0,1( )

x = F−1 u( ) = 2erfinv 2u −1( ) x N x | 0,1( )

x =

x 1( )… x D( )⎡⎣⎢

⎤⎦⎥T p

x | 0,I( ) = 1

2πexp − 1

2

x d( )2⎛

⎝⎜⎜⎜

⎞⎠⎟⎟⎟d=1

D∏

z = Σ1/2 x +

µ p

z |µ,Σ( )

Documents

Machine Learning - Columbia Universityjebara/4771/notes/class9x.pdf · 2014-10-21 · Machine Learning 4771 Instructor: Tony Jebara . Tony Jebara, Columbia University Topic 9