Probability*Basics** forMachineLearningProbabilityBasics** forMachineLearning* CSC411 ShenlongWang Friday,January15,2015 Basedonmanyothers’slidesandresourcesfromWikipedia

Probability Basics for Machine Learning

CSC411 Shenlong Wang*

Friday, January 15, 2015

*Based on many others’ slides and resources from Wikipedia

Outline

•  MoHvaHon •  NotaHon, definiHons, laws •  ExponenHal family distribuHons – E.g. Normal distribuHon

•  Parameter esHmaHon •  Conjugate prior

Why Represent Uncertainty?

•  The world is full of uncertainty – “Is there a person in this image?” – “What will the weather be like today?” – “Will I like this movie?”

•  We’re trying to build systems that understand and (possibly) interact with the real world

•  We oZen can’t prove something is true, but we can sHll ask how likely different outcomes are or ask for the most likely explanaHon

amazon

Why Use Probability to Represent Uncertainty?

•  Write down simple, reasonable criteria that you'd want from a system of uncertainty (common sense stuff), and you always get probability: –  Probability theory is nothing but common sense reduced to calculaHon. — Pierre Laplace, 1812

•  Cox Axioms (Cox 1946); See Bishop, SecHon 1.2.3 •  We will restrict ourselves to a relaHvely informal discussion of probability theory.

NotaHon •  A random variable X represents outcomes or states of the world.

•  We will write p(x) to mean Probability(X = x) •  Sample space: the space of all possible outcomes (may be discrete, conHnuous, or mixed)

•  p(x) is the probability mass (density) func8on – Assigns a number to each point in sample space – Non-‐negaHve, sums (integrates) to 1 –  IntuiHvely: how oZen does x occur, how much do we believe in x.

Joint Probability DistribuHon •  Prob(X=x, Y=y) – “Probability of X=x and Y=y” – p(x, y)

CondiHonal Probability DistribuHon •  Prob(X=x|Y=y) – “Probability of X=x given Y=y” – p(x|y) = p(x,y)/p(y)

The Rules of Probability

•  Sum Rule (marginalizaHon/summing out):

•  Product/Chain Rule:

),...,,(...)(

),()(

2112 3

Nx x x

y

xxxpxp

yxpxp

N

∑∑ ∑

∑

=

=

),...,|()...|()(),...,()()|(),(

111211 −=

=

NNN xxxpxxpxpxxpxpxypyxp

Bayes’ Rule

•  One of the most important formulas in probability theory

•  This gives us a way of “reversing” condiHonal probabiliHes

p(x | y) = p(y | x)p(x)p(y)

=p(y | x)p(x)p(y | x ')p(x ')

x '∑

Independence

•  Two random variables are said to be independent iff their joint distribuHon factors

•  Two random variables are condi8onally independent given a third if they are independent aZer condiHoning on the third

X ⊥Y⇔ p(x, y) = p(y | x)p(x) = p(x | y)p(y) = p(x)p(y)

X ⊥Y | Z⇔ p(x, y | z) = p(y | x, z)p(x | z) = p(y | z)p(x | z) ∀z

ConHnuous Random Variables •  Outcomes are real values. Probability density funcHons define distribuHons. – E.g.,

•  ConHnuous joint distribuHons: replace sums with integrals, and everything holds – E.g., MarginalizaHon and condiHonal probability

∫∫ ==yy

yPyzxPzyxPzxP )()|,(),,(),(

⎭⎬⎫

⎩⎨⎧ −−= 2

2 )(21exp

21),|( µ

σσπσµ xxP

Summarizing Probability DistribuHons

•  It is oZen useful to give summaries of distribuHons without defining the whole distribuHon (E.g., mean and variance)

•  Mean: •  Variance:

•  Nth moment:

dxxpxxxEx

)(][ ∫ ⋅==

dxxpxExxx

)(])[()var( 2∫ ⋅−=

=E[x2 ]−E[x]2

µn = (x − c)n ⋅ p(x)x∫ dx

ExponenHal Family

•  Family of probability distribuHons •  Many of the standard distribuHons belong to this family – Bernoulli, binomial/mulHnomial, Poisson, Normal (Gaussian), beta/Dirichlet,…

•  Share many important properHes – e.g. They have a conjugate prior (we’ll get to that later. Important for Bayesian staHsHcs)

•  First – let’s see some examples

DefiniHon •  The exponenHal family of distribuHons over x, given parameter η (eta) is the set of distribuHons of the form

•  x-‐scalar/vector, discrete/conHnuous •  η – ‘natural parameters’ •  u(x) – some funcHon of x (sufficient staHsHc) •  g(η) -‐ normalizer

)}(exp{)()()|( xugxhxp Tηηη =

1)}(exp{)()( =∫ dxxuxhg Tηη

Example 1: Bernoulli

•  Binary random variable -‐ •  p(heads) = µ •  Coin toss

xxxp −−= 1)1()|( µµµ

}1,0{∈X]1,0[∈µ

Example 1: Bernoulli

xxxp −−= 1)1()|( µµµ

}1

exp{ln)1(

)}1ln()1(lnexp{

xu

xx

⎟⎠

⎞⎜⎝

⎛−

−=

−−+=

µµ

µµ

)()(11)(

1ln

)(1)(

ηση

ησµµµ

η η

−=

+==⇒⎟⎟

⎠

⎞⎜⎜⎝

⎛

−=

=

=

−

ge

xxuxh


)exp()()|( xxp ηηση −=

Example 2: MulHnomial •  p(value k) = µk

•  For a single observaHon – die toss – SomeHmes called Categorical

•  For mulHple observaHons –  integer counts on N trials – Prob(1 came out 3 Hmes, 2 came out once,…,6 came out 7 Hmes if I tossed a die 20 Hmes)

1],1,0[1

=∈ ∑=

M

kkk µµ

∏∏ =

=M

k

xk

kk

Mk

xNxxP

11 !

!)|,...,( µµ

∑=

=M

kk Nx

1

Example 2: MulHnomial (1 observaHon)

}lnexp{1∑=

=M

kkkx µ

xxx=

=

)(1)(

uh


∏=

=M

k

xkMkxxP

11 )|,...,( µµ

)exp()|( xx Tp ηη =

Parameters are not independent due to constraint of summing to 1, there’s a slightly more involved notaHon to address that, see Bishop 2.4

Example 3: Normal (Gaussian) DistribuHon

•  Gaussian (Normal)

⎭⎬⎫

⎩⎨⎧ −−= 2

2 )(21exp

21),|( µ

σσπσµ xxp


•  µ is the mean •  σ2 is the variance •  Can verify these by compuHng integrals. E.g.,

⎭⎬⎫

⎩⎨⎧ −−= 2

2 )(21exp

21),|( µ

σσπσµ xxp

€

x ⋅ 12πσ

exp −12σ 2 (x −µ)2

⎧ ⎨ ⎩

⎫ ⎬ ⎭ dx = µ

x→−∞

x→∞

∫


•  MulHvariate Gaussian

€

P(x |µ,∑) = 2π ∑−1/ 2 exp −12(x −µ)T ∑−1(x −µ)

⎧ ⎨ ⎩

⎫ ⎬ ⎭


•  MulHvariate Gaussian

•  x is now a vector •  µ is the mean vector •  Σ is the covariance matrix

⎭⎬⎫

⎩⎨⎧ −∑−−∑=∑ −− )()(21exp2),|( 12/1

µµπµ xxxp T

Important ProperHes of Gaussians

•  All marginals of a Gaussian are again Gaussian •  Any condiHonal of a Gaussian is Gaussian •  The product of two Gaussians is again Gaussian

•  Even the sum of two independent Gaussian RVs is a Gaussian.

ExponenHal Family RepresentaHon

}21exp{)

4exp()2()2(

}21

21exp{

21

)(21exp

21),|(

2222

212

1

221

222

22

22

⎥⎦

⎤⎢⎣

⎡⎥⎦

⎤⎢⎣

⎡ −−=

=−

++−

=

⎭⎬⎫

⎩⎨⎧ −−=

−

xx

xx

xxp

σσµ

ηη

ηπ

µσσ

µσσπ

µσσπ

σµ


)(xh )(ηg Tη )(xu

Example: Maximum Likelihood For a 1D Gaussian

•  Suppose we are given a data set of samples of a Gaussian random variable X, D={x1,…, xN} and told that the variance of the data is σ2

What is our best guess of µ?

*Need to assume data is independent and idenHcally distributed (i.i.d.)

x1 x2 xN …


What is our best guess of µ? •  We can write down the likelihood func8on:

•  We want to choose the µ that maximizes this expression – Take log, then basic calculus: differenHate w.r.t. µ, set derivaHve to 0, solve for µ to get sample mean

∏ ∏= = ⎭

⎬⎫

⎩⎨⎧ −−==µ

N

i

N

i

ii xxpdp1 1

22 )(

21exp

21),|()|( µ

σσπσµ

€

µML =1N

xii=1

N∑


x1 x2 xN … µML

σML

Maximum Likelihood

ML esHmaHon of model parameters for ExponenHal Family

( )

)g(for solve 0, set to...,)|(

)}(exp{)()(),...,()|( 1

ηηη

ηηη

∇=∂

∂

== ∑∏Dp

xugxhxxpDpn

nTN

nN

∑=

=∇−N

nnML xu

Ng

1)(1)(ln η

•  Can in principle be solved to get esHmate for eta. •  The soluHon for the ML esHmator depends on the data only through sum over u, which is therefore called sufficient sta8s8c •  What we need to store in order to esHmate parameters.

Bayesian ProbabiliHes

•  is the likelihood func8on •  is the prior probability of (or our prior belief over) θ –  our beliefs over what models are likely or not before seeing any data

•  is the normaliza8on constant or par88on func8on

•  is the posterior distribu8on –  Readjustment of our prior beliefs in the face of data

)()()|()|(

dppdpdp θθ

θ =

∫= θθθ dPdpdp )()|()(

)|( θdp)(θp

)|( dp θ

Example: Bayesian Inference For a 1D Gaussian

•  Suppose we have a prior belief that the mean of some random variable X is µ0 and the variance of our belief is σ02

•  We are then given a data set of samples of X, d={x1,…, xN} and somehow know that the variance of the data is σ2

What is the posterior distribu7on over (our belief about the value of) µ?


x1 x2 xN …


x1 x2 xN … µ0

σ0

Prior belief


•  Remember from earlier

•  is the likelihood func8on

•  is the prior probability of (or our prior belief over) µ

)()()|()|(

dppdpdp µµ

=µ

)|( µdp

)(µp

∏ ∏= = ⎭

⎬⎫

⎩⎨⎧ −−==µ

N

i

N

i

ii xxPdp1 1

22 )(

21exp

21),|()|( µ

σσπσµ

⎭⎬⎫

⎩⎨⎧

µ−µ−=µµ 202

0000 )(

21exp

21),|(

σσπσp


),|()|()()|()|(

NNDppDpDp

σµµ=µ

µµ∝µ

Normal

€

µN =σ 2

Nσ 02 +σ 2 µ0 +

Nσ 02

Nσ 02 +σ 2 µML

1σN2 =

1σ 02 +

Nσ 2

where


x1 x2 xN … µ0

σ0

Prior belief


x1 x2 xN … µ0

σ0

Prior belief µML

σML

Maximum Likelihood


x1 x2 xN µN

σN

Prior belief Maximum Likelihood

Posterior Distribu8on

Image from xkcd.com

Conjugate Priors •  NoHce in the Gaussian parameter esHmaHon example that the funcHonal form of the posterior was that of the prior (Gaussian)

•  Priors that lead to that form are called ‘conjugate priors’

•  For any member of the exponenHal family there exists a conjugate prior that can be wri{en like

•  MulHply by likelihood to obtain posterior (up to normalizaHon) of the form

•  NoHce the addiHon to the sufficient staHsHc •  ν is the effecHve number of pseudo-‐observaHons.

}exp{)(),(),|( χνηηνχνχη ν Tgfp =

)})((exp{)(),,|(1

νχηηνχη ν +∝ ∑=

+N

nn

TN xugDp

Conjugate Priors -‐ Examples

•  Beta for Bernoulli/binomial •  Dirichlet for categorical/mulHnomial •  Normal for mean of Normal •  And many more... – Conjugate Prior Table:

•  h{p://en.wikipedia.org/wiki/Conjugate_prior

Documents

Probability*Basics** for*Machine*Learning*Probability*Basics** for*Machine*Learning* CSC411 Shenlong*Wang* Friday,*January*15,*2015* *Based*on*many*others’*slides*and*resources*from*Wikipedia*

Probability*Basics** forMachineLearningProbabilityBasics** forMachineLearning* CSC411 ShenlongWang Friday,January15,2015 Basedonmanyothers’slidesandresourcesfromWikipedia