Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
A6523 Signal Modeling, Statistical Inference
and Data Mining in Astrophysics Spring 2011
Reading • Chapter 5 (continued)
Lecture 8 • Key points in probability • CLT • CLT examples
Prior vs Likelihood
Box & Tiao
“Learning” in Bayesian Estimation
Box & Tiao
3
I. Mutually exclusive events:
If a occurs then b cannot have occurred.
Let c = a+ b + ! “or” (same as a " b)
P (c) = P{a or b occurred} = P (a) + P (b)
Let d = a · b · ! “and” (same as a # b)
P (d) = P{a and b occurred} = 0 if mutually exclusive
II. Non-mutually exclusive events:
P (c) = P{a or b} = P (a) + P (b)$ P (ab)! "# $
III. Independent events:
P (ab) % P (a)P (b)
Examples
I. Mutually exclusive events
toss a coin once:
2 possible outcomes H & T
H & T are mutually exclusive
H & T are not independent because P (HT ) = P{heads & tails} = 0 so P (HT ) &= P (H)P (T ).
4
II. Independent events
toss a coin twice = experiment
The outcomes of the experiment are
1st toss 2nd toss
H1 H2
H1 T2
T1 H2
T1 T2
events might be defined as:
H1H2 = event that H on 1st toss, H on 2nd
H1T2 = event that H on 1st toss, T on 2nd
T1H2 = event that T on 1st toss, H on 2nd
T1T2 = event that T on 1st toss, T on 2nd
note P (H1H2) = P (H1)P (H2) [as long as coin not altered between tosses]
5
Random Variables
Of interest to us is the distribution of probability along the real number axis:
Random variables assign numbers to events or, more precisely, map the event space into a set of numbers:
a !" X(a)
event !" number
The definition of probability translates directly over to the numbers that are assigned by random variables.
The following properties are true for a real random variable.
1. Let {X # x} = event that the r.v. X is less than the number x; defined for all x [this defines all
intervals on the real number line to be events]
2. the events {X = +$} and {X = !$} have zero probability. (Otherwise, moments would not
be finite, generally.)
Distribution function: (CDF = Cumulative Distribution Function)
FX(x) = P{X # x} % P{all eventsA : X(A) # x}
properties:
1. FX(x) is a monotonically increasing function of x.
2. F (!$) = 0, F (+$) = 1
3. P{x1 # X # x2} = F (x2)! F (x1)
Probability Density Function (pdf)
fX(x) =dFX(x)
dx
properties:
1. fX(x) dx = P{x # X # x+ dx}
2.!!"! dx fX(x) = FX($)! FX(!$) = 1! 0 = 1
All three measures are localization measures
Other quantities are needed to measure the width and asymmetry of the PDF, etc.
6
Continuous r.v.’s: derivative of FX(x) exists !x
Discrete random variables: use delta functions to write the pdf in pseudo continuous form.
e.g. coin flipping
Let X =
!
"
#
1 heads
"1 tails
then
fX(x) =1
2[!(x+ 1) + !(x " 1)]
FX(x) =1
2[U(x+ 1) + U(x" 1)]
Functions of a random variable:
The function Y = g(X) is a random variable that is a mapping from some event A to a number Y
according to:
Y (A) = g[X(A)]
Theorem, if Y = g(X), then the pdf of Y is
fY (y) =n$
j=1
fX(xj)
|dg(x)/dx|x=xj
,
where xj, j = 1, n are the solutions of x = g!1(y). Note the normalization property is conserved (unit
area).
This is one of the most important equations!
Example
Y = g(X) = aX + b
dg
dx= a
g!1(y) = x1 =y " b
a
fY (y) =fX(x1)
|dg(x1)/dx|= a!1fX(
y " b
a).
*
Comment about “natural” random number generators
7
To check: show that!!"! dy fY (y) = 1
Example Suppose we want to transform from a uniform distribution to an exponential distribution:
We want ant fY (y) = exp(!y). A typical random number generator gives fX(x) with
fX(x) =
"1, 0 " x < 1;0, otherwise.
Choose y = g(x) = ! ln(x). Then:
dg
dx= !
1
x
x1 = g"1(y) = e"y
fY (y) =fX [exp(!y)]
|! 1/x1|= x1 = e"y.
Moments
We will always use angular brackets < > to denote average over an ensemble (integrating overan ensemble); time averages and other sample averages will be denoted di!erently.
Expected value of a random variable:
E(X) # $X% =#
dx xfX(x)
&denotes expectation w.r.t. the PDF of x
Arbitrary power:
$Xn% =#
dx xnfX(x)
Variance:!2x = $X2% ! $X%2
Function of a random variable: If y = g(x) and $Y % '!
dy y fY (y) then it is easy to show that
$Y % =!
dx g(x)fX(x).
Proof:
$y% '#
dy fY (y) =
#
dyn
$
j=1
fX [xj(y)]
|dg[xj(y)]/dx|
Factoid: Poission events in time have spacings that are exponentially distributed
8
A change of variable: dy = dgdx dx yields the result.
Central Moments:µn = !(X " !X#)n#
Moment Tests:
Moments are useful for testing hypotheses such as whether a given PDF is consistent with data:
E.g. Consistency with Gaussian PDF:
kurtosis k =µ4
µ3/22
" 3 = 0
skewness parameter ! =µ3
µ3/22
= 0
k > 0 $ 4th moment proportionately larger $ larger amplitude tail than Gaussian and lessprobable values near the mean.
9
Uses of Moments:
Often one wants to infer the underlying PDF of an observable, e.g. perhaps because determinationof the PDF is tantamount to understanding the underlying physics of some process.
Two approaches are:
1. construct a histogram and compare the shape with a theoretical shape.
2. determine some of the moments (usually low-order) and compare.
Suppose the data are {xj , j = 1, N}
1. One could form bins of size !x and count how many xj fall into each bin. If N is largeenough so that nk = # points in the k-th bin is also large, then a reasonably good estimateof the PDF can be made. (But beware of dependence of results on choice of binning.)
2. However, often times N is too small or one would like to determine only basic information
about the shape of the distribution (is it symmetric?), or determine the mean and varianceof the PDF or test whether the data are consistent with a given PDF (hypothesis testing).
Some of the typical situations are:
i) assume the data were drawn from a Gaussian parent PDF; estimate the mean and ! ofthe Gaussian [parameter estimation]
ii) test whether the data are consistent with a Gaussian PDF [moment test]
note that if the r.v. is zero mean then the PDF is determined solely by one parameter: !
fX(x) =1!2"!2
e!x2/2!2
The moments are
"xn# =
!
"
#
1 · 3...(n$ 1)!n % (n$ 1)!! !n n even
0 n odd
Therefore, the n = 2 moment = 1st non-zero moment & all other moments.
This statement remains for more multi-dimensional Gaussian processes:
Any moment of order higher than 3 is redundant ... or can be used as a test forgaussianity.
10
Characteristic Function:
Of considerable use is the characteristic function
!X(!) ! "ei!x# !!
dx fX(x) ei!x.
If we know !X(!) then we know all there is to know about the PDF because
fX(x) =1
2"
!
d! !X(!) e!i!x
is the inversion formula.
If we know all the moments of fX(x), then we also can completely characterize fX(x). Similarly,
the characteristic function is a moment-generating function:
!X(!) = "ei!X# !" "#
n=0
(i!X)n
n!
$
="#
n=0
(i!)n
n!"Xn#
because the expectation of the sum = sum of the expectations.
By taking derivatives we can show that
#!
#!|!=0 = i"X#
#2!
#!2|!=0 = i2"X2#
#k!
#!k|!=0 = in"Xn#
or
"Xn# = i!n #n!
#!n|!=0 = ($i)n
#n!
#!n|!=0 Price#s theorem
Characteristic functions are useful for deriving PDFs of combinations of r.v.’s as well as for deriving
particular moments.
11
Joint Random Variables
Let X and Y be two random variables with their associated sample spaces. The actual eventsassociated with X and Y may or may not be independent (e.g. throwing a die may map into
X ; choosing colored marbles from a hat may map into Y ). The relationship of the events will bedescribed by the joint distribution function of X and Y :
FXY (x, y) ! P{X " x, Y " y}
and the joint probability density function is
fXY (x, y) !!2Fxy(x, y)
!x!y(a two dimensional PDF)
Note that the one dimensional PDF of X , for example, is obtained by integrating the joint PDFover all y:
fX(x) =
!
dy fXY (x, y)
which corresponds to asking what the PFf of X is given that the certain event for Y occurs.
Example: flip two coins a and b. Let heads =1; tails =0. Define 2 r.v.’s: X = a + b; Y = a. With
these definitions X + Y are statistically dependent.
Characteristic function of joint r.v.’s:
!XY ("1,"2) = #ei(!1X+!2Y )$ =!!
dx dy ei(!1x+!2y)fXY (x, y).
For x, y independent
!XY ("1,"2) =" !
dx fX(x) ei!1x
# " !
dy fY (y) ei!2y
#
! !X("1) !Y ("2).
Example for independent r.v.’s: flip two coins a and b. As before, heads = 1 and tails = 0, let
x = a, y = b (x and y are independent).
Independent random variables
Two random variables are said to be independent if the events mapping into one r.v. are indepen-dent of those mapping into the other.
12
In this case, joint probabilities are factorable so that
FXY (x, y) = FX(x) FY (y)
fXY (x, y) = fX(x) fY (y).
Such factorization is plausible if one considers moments of independent r.v.’s:
!XnY m" = !Xn"!Y m"
which follows from
!XnY m" #! !
dx dy xnym fXY (x, y) =" !
dx xnfX(x)# " !
dy ymfY (y)#
.
13
Convolution theorem for sums of independent RVs
If Z = X+Y where X, Y are independent random variables, then the PDF of Z is the convolutionof the PDFs of X and Y :
fZ(z) = fX(x) ! fY (y) =!
dx fX(x) fY (z " x) =
!
dx fX(z " x) fY (x).
proof: By definition,
fZ(z) =d
dzFZ(z)
Consider
Fz(z) = P{Z # z}
Now, as before, this is
FZ(z) = P{X + Y # z} = P{Y # z "X}.
To evaluate this, first evaluate the probability P{Y # z " x} where x is just a number.
Now
P{Y # z " x} $ FY (z " x) $! z!x
!"dy fY (y)
but P{Y # z " X} is the probability that Y # z " x for all values of x so we need to integrate
over x and weight by the probability of x:
P{Y # z "X} =
! "
!"dx fX(x)
! z!x
!"dy FY (y)
that is, P{Y # z "X} is the expected value of FY (z " x). By the Leibniz integration formula
d
db
! g(b)
a
d! h(!) $ h(g(b))dg(b)
db
we obtain the convolution results.
14
Characteristic function of Z = X + Y
For X, Y independent we have
fZ = fX ! fY " !Z(!) = #ei!z$ = !X(!) !Y (!)
Variance of Z: if variance of X and Y are "2X , "
2Y , then variance of Z is "2
Z = "2X + "2
Y .
Assume X and Y and hence Z are zero mean r.v.’s, then we have
"2X = #x2$ = i!2 "2#x
"!2 (! = 0) = %"2#x
"!2 (! = 0)
"2Y = #y2$ = %"2#y
"!2 (! = 0)
Using Price’s theorem:
"2Z = #Z2$ = %
#2$Z
#!2(! = 0)
= %#2
#!2[$X(!) $Y (!)]!=0
= %#
#!
!
$X#$Y
#!+ $Y
#$X
#!
"
!=0
= %!
$X#2$Y
#!2+ $Y
#2$x
#!2+ 2
#$X
#!·#$Y
#!
"
!=0.
We have ”discovered” that variances add (independent variables only):
"2Z = "2
X + "2Y .
15
Multivariate random variables: N dimensional
The results for the bivariate case are easily extrapolated. If
Z = X1 +X2 + . . .+XN =N!
j=1
Xj
where the Xj are all independent r.v.’s, then
fZ(z) = fX1! fX2
! . . . ! fXN
and
!Z =N"
j=1
!Xj(!)
and
"2Z "
N!
j=1
"2Xj.
16
Central Limit Theorem:
Let
ZN =1!N
N!
j=1
Xj
where the Xj are independent r.v.’s with means and variances
µj " #Xj$
!2j = #X2
j $ % #Xj$2.
and the PDFs of the Xj ’s are almost arbitrary. Restrictions on the distributions of each Xj are
that
i) !2j > m > 0 m = constant
ii) #|X|n$ < M = constant for n > 2
In the limit N %& ', ZN becomes a Gaussian random variable with mean
#ZN$ =1!N
N!
j=1
µj
and variance
!2Z =
1
N
N!
j=1
!2j .
Example: suppose the Xj are all uniformly distributed between ±12 , so
fX(x) = !(x) (sin "f
"f=
sin !2
#/2
17
Thus the characteristic function is
!j(!) = !ei!xj" =sin !/2
!/2
Graphically:
Gaussian
N = 2 N = 3 N = #
e!x2
(sin !
2
!/2 )2 ( sin !/2!/2 )3 e!!2
From the convolution results we have
""NZN
(!) =!sin!/2
!/2
"N
From the transformation of random variables we have that
fZN(x) =
$N f"NZN
($Nx)
and by the scaling theorem for Fourier transforms
"ZN(!) = ""
NZN
! !$N
"
=!sin!/2
$N
!/2$N
"N
.
18
Now
limN!"
!ZN(") = e#
1
2!2"2
Z
or
fZN(x) =
1!
2# $2Z
e#x2/2"2
Z .
Consistency with this limiting form can be seen by expanding !ZNfor small "
!ZN(") !
""/2"N # 1
3!("/2"N)3
"/2"N
#N
! 1#"2
24
that is identical to the expansion of exp (#"2$2Z/2).
if the CLT holds:
CLT Comments
• A sum of Gaussian RVs is automatically a Gaussian RV (can show using characteristic functions)
• Convergence to a Gaussian form depends on the actual PDFs of the terms in the sum and their relative variances
• Exceptions exist!
19
CLT: Example of a PDF that does not work
The Cauchy distribution and its characteristic function are
fX(x) =!
"
1
!2 + x2
!(w) = e!!|"|
Now
ZN =1!N
N!
j=1
xj
has a characteristic function
!N (#) = e!N!|"|/"N
By inspection the exponential will not converge to a Gaussian. Instead, the sum of N Cauchy RVsis a Cauchy RV.
Is the Cauchy distribution a legitimate PDF? No!
The variance diverges:
"X2# =" #
#dx x2!
"
1
!2 + x2$ %.
A CLT Problem • Consider a set of N quantities
that are i.i.d. (independently and identically distributed) with zero mean
• We are interested in the cross correlation between all unique pairs
• What do you expect <CN> to be? • What do you expect the PDF of
CN to be?
{ai, i = 1, . . . , N}�ai� = 0
�aiaj� = σ2aδij
CN =1
NX
�
i<j
aiaj =1
NX
N−1�
i=1
N�
j=i+1
aiaj
NX = N(N − 1)/2
A CLT Problem (2) • Note:
• The number of independent quantities (random variables) is N
• The sum CN has terms that are products of i.i.d. variables
• Any given term in the sum is s.i. of some of the other terms
• The PDF of products is different from the PDF of individual factors
• In the limit N >> 1 there should be many independent terms in the sum
• N=2: • Can show that PDF is symmetric (odd
order moments = 0) • N>2:
• Can show that the third moment ≠ 0 • What gives?
20
Conditional Probabilities & Bayes’ Theroem
We have considered P (!), the probability of an event ! . Also obeying axioms of probabilityare conditional probabilities: P ("|!), the probability of the event " given that the event ! has
occurred.
P ("|!) !P ("!)
P (!)
Recast the axioms as
I. P ("|!) " 0
II. P ("|!) + P ("̄|!) = 1
III.
P ("! |#) = P ("|#)P (! |"#)= P (! |#)P ("|!#)
How does this relate to experiments? Use the product rule:
P (! |"#) =P (! |#)P ("|!#)
P ("|#)
or, letting M = model (or hypothesis), D = data and I = background information (assumptions),
P (M|DI) = P (M|I)P (D|MI)P (D|I)
Terms:
prior: P (M|I)
sampling distribution for D: P (D|MI) (also called likelihood for M)
prior predictive for D: P (D|I) (also called global likelihood for M or evidence for M)
21
Particular strengths of Bayesian method include:
1. One must often be explicit about what is assumed about I, the background information.
2. In assessing models, we get a PDF for parameters rather than just point estimates.
3. Occam’s razor (simpler models win, all else being equal) is easily invoked when comparingmodels. We may have many di!erent models, Mi that we wish to compare. Form the
odds ratio: from the posterior PDFs: P (Mi|DI):
Oi,j !P (Mi|DI)P (Mj |DI)
=P (Mi|I)P (Mj |I)
P (D|MiI)P (D|MjI)
.
22
Example
Data: {ki}, i = 1, . . . , n, drawn from Poisson process
Poisson PDF: Pk =!ke!!
k!
Want: mean of process
Frequentist approach:
We need an estimator for the mean; consider the likelihood
f(!) =n!
i=1
P (ki) =1
"ni=1 ki!
!!n
i=1kie!n!.
Maximizing,
df
d!= 0 = f(!)
#
!n + !!1n$
i=1
ki
%
we obtain an estimator for the mean is
k̄ =1
n
n$
i=1
ki.
23
Bayesian approach:
Likelihood (as before):
P (D|MI) =n!
i=1
P (ki) =1
"nı=1 ki!
!!n
i=1kie!n!.
Prior:P (M|I) = P (!|I)
AssumeP (!|I)! ! !U(!)
Prior Predictive:
P (D|I) "# "
!"d!U(!)P (D|MI) =
n!nx̄
"nı=1 ki!
!(nx̄).
Combining all the above, we find
P (!|{ki}I) =nnx̄
!(nx̄)!nx̄e!n! U(!)
Note that rather than getting a point estimate for the mean, we get a PDF for its value. Forhypothesis testing, this is much more useful than a point estimate.