Signal Modeling, Statistical Inference and Data Mining in ...hosting.astro.cornell.edu/~cordes/A6523/A6523_lecture_8.pdfSignal Modeling, Statistical Inference and Data Mining in Astrophysics

A6523 Signal Modeling, Statistical Inference

and Data Mining in Astrophysics Spring 2011

Reading • Chapter 5 (continued)

Lecture 8 • Key points in probability • CLT • CLT examples

Prior vs Likelihood

Box & Tiao

“Learning” in Bayesian Estimation

Box & Tiao

3

I. Mutually exclusive events:

If a occurs then b cannot have occurred.

Let c = a+ b + ! “or” (same as a " b)

P (c) = P{a or b occurred} = P (a) + P (b)

Let d = a · b · ! “and” (same as a # b)

P (d) = P{a and b occurred} = 0 if mutually exclusive

II. Non-mutually exclusive events:

P (c) = P{a or b} = P (a) + P (b)$ P (ab)! "# $

III. Independent events:

P (ab) % P (a)P (b)

Examples

I. Mutually exclusive events

toss a coin once:

2 possible outcomes H & T

H & T are mutually exclusive

H & T are not independent because P (HT ) = P{heads & tails} = 0 so P (HT ) &= P (H)P (T ).

4

II. Independent events

toss a coin twice = experiment

The outcomes of the experiment are

1st toss 2nd toss

H1 H2

H1 T2

T1 H2

T1 T2

events might be defined as:

H1H2 = event that H on 1st toss, H on 2nd

H1T2 = event that H on 1st toss, T on 2nd

T1H2 = event that T on 1st toss, H on 2nd

T1T2 = event that T on 1st toss, T on 2nd

note P (H1H2) = P (H1)P (H2) [as long as coin not altered between tosses]

5

Random Variables

Of interest to us is the distribution of probability along the real number axis:

Random variables assign numbers to events or, more precisely, map the event space into a set of numbers:

a !" X(a)

event !" number

The definition of probability translates directly over to the numbers that are assigned by random variables.

The following properties are true for a real random variable.

1. Let {X # x} = event that the r.v. X is less than the number x; defined for all x [this defines all

intervals on the real number line to be events]

2. the events {X = +$} and {X = !$} have zero probability. (Otherwise, moments would not

be finite, generally.)

Distribution function: (CDF = Cumulative Distribution Function)

FX(x) = P{X # x} % P{all eventsA : X(A) # x}

properties:

1. FX(x) is a monotonically increasing function of x.

2. F (!$) = 0, F (+$) = 1

3. P{x1 # X # x2} = F (x2)! F (x1)

Probability Density Function (pdf)

fX(x) =dFX(x)

dx

properties:

1. fX(x) dx = P{x # X # x+ dx}

2.!!"! dx fX(x) = FX($)! FX(!$) = 1! 0 = 1

All three measures are localization measures

Other quantities are needed to measure the width and asymmetry of the PDF, etc.

6

Continuous r.v.’s: derivative of FX(x) exists !x

Discrete random variables: use delta functions to write the pdf in pseudo continuous form.

e.g. coin flipping

Let X =

!

"

#

1 heads

"1 tails

then

fX(x) =1

2[!(x+ 1) + !(x " 1)]

FX(x) =1

2[U(x+ 1) + U(x" 1)]

Functions of a random variable:

The function Y = g(X) is a random variable that is a mapping from some event A to a number Y

according to:

Y (A) = g[X(A)]

Theorem, if Y = g(X), then the pdf of Y is

fY (y) =n$

j=1

fX(xj)

|dg(x)/dx|x=xj

,

where xj, j = 1, n are the solutions of x = g!1(y). Note the normalization property is conserved (unit

area).

This is one of the most important equations!

Example

Y = g(X) = aX + b

dg

dx= a

g!1(y) = x1 =y " b

a

fY (y) =fX(x1)

|dg(x1)/dx|= a!1fX(

y " b

a).

*

Comment about “natural” random number generators

7

To check: show that!!"! dy fY (y) = 1

Example Suppose we want to transform from a uniform distribution to an exponential distribution:

We want ant fY (y) = exp(!y). A typical random number generator gives fX(x) with

fX(x) =

"1, 0 " x < 1;0, otherwise.

Choose y = g(x) = ! ln(x). Then:

dg

dx= !

1

x

x1 = g"1(y) = e"y

fY (y) =fX [exp(!y)]

|! 1/x1|= x1 = e"y.

Moments

We will always use angular brackets < > to denote average over an ensemble (integrating overan ensemble); time averages and other sample averages will be denoted di!erently.

Expected value of a random variable:

E(X) # $X% =#

dx xfX(x)

&denotes expectation w.r.t. the PDF of x

Arbitrary power:

$Xn% =#

dx xnfX(x)

Variance:!2x = $X2% ! $X%2

Function of a random variable: If y = g(x) and $Y % '!

dy y fY (y) then it is easy to show that

$Y % =!

dx g(x)fX(x).

Proof:

$y% '#

dy fY (y) =

#

dyn

$

j=1

fX [xj(y)]

|dg[xj(y)]/dx|

Factoid: Poission events in time have spacings that are exponentially distributed

8

A change of variable: dy = dgdx dx yields the result.

Central Moments:µn = !(X " !X#)n#

Moment Tests:

Moments are useful for testing hypotheses such as whether a given PDF is consistent with data:

E.g. Consistency with Gaussian PDF:

kurtosis k =µ4

µ3/22

" 3 = 0

skewness parameter ! =µ3

µ3/22

= 0

k > 0 $ 4th moment proportionately larger $ larger amplitude tail than Gaussian and lessprobable values near the mean.

9

Uses of Moments:

Often one wants to infer the underlying PDF of an observable, e.g. perhaps because determinationof the PDF is tantamount to understanding the underlying physics of some process.

Two approaches are:

1. construct a histogram and compare the shape with a theoretical shape.

2. determine some of the moments (usually low-order) and compare.

Suppose the data are {xj , j = 1, N}

1. One could form bins of size !x and count how many xj fall into each bin. If N is largeenough so that nk = # points in the k-th bin is also large, then a reasonably good estimateof the PDF can be made. (But beware of dependence of results on choice of binning.)

2. However, often times N is too small or one would like to determine only basic information

about the shape of the distribution (is it symmetric?), or determine the mean and varianceof the PDF or test whether the data are consistent with a given PDF (hypothesis testing).

Some of the typical situations are:

i) assume the data were drawn from a Gaussian parent PDF; estimate the mean and ! ofthe Gaussian [parameter estimation]

ii) test whether the data are consistent with a Gaussian PDF [moment test]

note that if the r.v. is zero mean then the PDF is determined solely by one parameter: !

fX(x) =1!2"!2

e!x2/2!2

The moments are

"xn# =

!

"

#

1 · 3...(n$ 1)!n % (n$ 1)!! !n n even

0 n odd

Therefore, the n = 2 moment = 1st non-zero moment & all other moments.

This statement remains for more multi-dimensional Gaussian processes:

Any moment of order higher than 3 is redundant ... or can be used as a test forgaussianity.

10

Characteristic Function:

Of considerable use is the characteristic function

!X(!) ! "ei!x# !!

dx fX(x) ei!x.

If we know !X(!) then we know all there is to know about the PDF because

fX(x) =1

2"

!

d! !X(!) e!i!x

is the inversion formula.

If we know all the moments of fX(x), then we also can completely characterize fX(x). Similarly,

the characteristic function is a moment-generating function:

!X(!) = "ei!X# !" "#

n=0

(i!X)n

n!

$

="#

n=0

(i!)n

n!"Xn#

because the expectation of the sum = sum of the expectations.

By taking derivatives we can show that

#!

#!|!=0 = i"X#

#2!

#!2|!=0 = i2"X2#

#k!

#!k|!=0 = in"Xn#

or

"Xn# = i!n #n!

#!n|!=0 = ($i)n

#n!

#!n|!=0 Price#s theorem

Characteristic functions are useful for deriving PDFs of combinations of r.v.’s as well as for deriving

particular moments.

11

Joint Random Variables

Let X and Y be two random variables with their associated sample spaces. The actual eventsassociated with X and Y may or may not be independent (e.g. throwing a die may map into

X ; choosing colored marbles from a hat may map into Y ). The relationship of the events will bedescribed by the joint distribution function of X and Y :

FXY (x, y) ! P{X " x, Y " y}

and the joint probability density function is

fXY (x, y) !!2Fxy(x, y)

!x!y(a two dimensional PDF)

Note that the one dimensional PDF of X , for example, is obtained by integrating the joint PDFover all y:

fX(x) =

!

dy fXY (x, y)

which corresponds to asking what the PFf of X is given that the certain event for Y occurs.

Example: flip two coins a and b. Let heads =1; tails =0. Define 2 r.v.’s: X = a + b; Y = a. With

these definitions X + Y are statistically dependent.

Characteristic function of joint r.v.’s:

!XY ("1,"2) = #ei(!1X+!2Y )$ =!!

dx dy ei(!1x+!2y)fXY (x, y).

For x, y independent

!XY ("1,"2) =" !

dx fX(x) ei!1x

# " !

dy fY (y) ei!2y

#

! !X("1) !Y ("2).

Example for independent r.v.’s: flip two coins a and b. As before, heads = 1 and tails = 0, let

x = a, y = b (x and y are independent).

Independent random variables

Two random variables are said to be independent if the events mapping into one r.v. are indepen-dent of those mapping into the other.

12

In this case, joint probabilities are factorable so that

FXY (x, y) = FX(x) FY (y)

fXY (x, y) = fX(x) fY (y).

Such factorization is plausible if one considers moments of independent r.v.’s:

!XnY m" = !Xn"!Y m"

which follows from

!XnY m" #! !

dx dy xnym fXY (x, y) =" !

dx xnfX(x)# " !

dy ymfY (y)#

.

13

Convolution theorem for sums of independent RVs

If Z = X+Y where X, Y are independent random variables, then the PDF of Z is the convolutionof the PDFs of X and Y :

fZ(z) = fX(x) ! fY (y) =!

dx fX(x) fY (z " x) =

!

dx fX(z " x) fY (x).

proof: By definition,

fZ(z) =d

dzFZ(z)

Consider

Fz(z) = P{Z # z}

Now, as before, this is

FZ(z) = P{X + Y # z} = P{Y # z "X}.

To evaluate this, first evaluate the probability P{Y # z " x} where x is just a number.

Now

P{Y # z " x} $ FY (z " x) $! z!x

!"dy fY (y)

but P{Y # z " X} is the probability that Y # z " x for all values of x so we need to integrate

over x and weight by the probability of x:

P{Y # z "X} =

! "

!"dx fX(x)

! z!x

!"dy FY (y)

that is, P{Y # z "X} is the expected value of FY (z " x). By the Leibniz integration formula

d

db

! g(b)

a

d! h(!) $ h(g(b))dg(b)

db

we obtain the convolution results.

14

Characteristic function of Z = X + Y

For X, Y independent we have

fZ = fX ! fY " !Z(!) = #ei!z$ = !X(!) !Y (!)

Variance of Z: if variance of X and Y are "2X , "

2Y , then variance of Z is "2

Z = "2X + "2

Y .

Assume X and Y and hence Z are zero mean r.v.’s, then we have

"2X = #x2$ = i!2 "2#x

"!2 (! = 0) = %"2#x

"!2 (! = 0)

"2Y = #y2$ = %"2#y

"!2 (! = 0)

Using Price’s theorem:

"2Z = #Z2$ = %

#2$Z

#!2(! = 0)

= %#2

#!2[$X(!) $Y (!)]!=0

= %#

#!

!

$X#$Y

#!+ $Y

#$X

#!

"

!=0

= %!

$X#2$Y

#!2+ $Y

#2$x

#!2+ 2

#$X

#!·#$Y

#!

"

!=0.

We have ”discovered” that variances add (independent variables only):

"2Z = "2

X + "2Y .

15

Multivariate random variables: N dimensional

The results for the bivariate case are easily extrapolated. If

Z = X1 +X2 + . . .+XN =N!

j=1

Xj

where the Xj are all independent r.v.’s, then

fZ(z) = fX1! fX2

! . . . ! fXN

and

!Z =N"

j=1

!Xj(!)

and

"2Z "

N!

j=1

"2Xj.

16

Central Limit Theorem:

Let

ZN =1!N

N!

j=1

Xj

where the Xj are independent r.v.’s with means and variances

µj " #Xj$

!2j = #X2

j $ % #Xj$2.

and the PDFs of the Xj ’s are almost arbitrary. Restrictions on the distributions of each Xj are

that

i) !2j > m > 0 m = constant

ii) #|X|n$ < M = constant for n > 2

In the limit N %& ', ZN becomes a Gaussian random variable with mean

#ZN$ =1!N

N!

j=1

µj

and variance

!2Z =

1

N

N!

j=1

!2j .

Example: suppose the Xj are all uniformly distributed between ±12 , so

fX(x) = !(x) (sin "f

"f=

sin !2

#/2

17

Thus the characteristic function is

!j(!) = !ei!xj" =sin !/2

!/2

Graphically:

Gaussian

N = 2 N = 3 N = #

e!x2

(sin !

2

!/2 )2 ( sin !/2!/2 )3 e!!2

From the convolution results we have

""NZN

(!) =!sin!/2

!/2

"N

From the transformation of random variables we have that

fZN(x) =

$N f"NZN

($Nx)

and by the scaling theorem for Fourier transforms

"ZN(!) = ""

NZN

! !$N

"

=!sin!/2

$N

!/2$N

"N

.

18

Now

limN!"

!ZN(") = e#

1

2!2"2

Z

or

fZN(x) =

1!

2# $2Z

e#x2/2"2

Z .

Consistency with this limiting form can be seen by expanding !ZNfor small "

!ZN(") !

""/2"N # 1

3!("/2"N)3

"/2"N

#N

! 1#"2

24

that is identical to the expansion of exp (#"2$2Z/2).

if the CLT holds:

CLT Comments

• A sum of Gaussian RVs is automatically a Gaussian RV (can show using characteristic functions)

• Convergence to a Gaussian form depends on the actual PDFs of the terms in the sum and their relative variances

• Exceptions exist!

19

CLT: Example of a PDF that does not work

The Cauchy distribution and its characteristic function are

fX(x) =!

"

1

!2 + x2

!(w) = e!!|"|

Now

ZN =1!N

N!

j=1

xj

has a characteristic function

!N (#) = e!N!|"|/"N

By inspection the exponential will not converge to a Gaussian. Instead, the sum of N Cauchy RVsis a Cauchy RV.

Is the Cauchy distribution a legitimate PDF? No!

The variance diverges:

"X2# =" #

#dx x2!

"

1

!2 + x2$ %.

A CLT Problem • Consider a set of N quantities

that are i.i.d. (independently and identically distributed) with zero mean

• We are interested in the cross correlation between all unique pairs

• What do you expect <CN> to be? • What do you expect the PDF of

CN to be?

{ai, i = 1, . . . , N}�ai� = 0

�aiaj� = σ2aδij

CN =1

NX

�

i<j

aiaj =1

NX

N−1�

i=1

N�

j=i+1

aiaj

NX = N(N − 1)/2

A CLT Problem (2) • Note:

• The number of independent quantities (random variables) is N

• The sum CN has terms that are products of i.i.d. variables

• Any given term in the sum is s.i. of some of the other terms

• The PDF of products is different from the PDF of individual factors

• In the limit N >> 1 there should be many independent terms in the sum

• N=2: • Can show that PDF is symmetric (odd

order moments = 0) • N>2:

• Can show that the third moment ≠ 0 • What gives?

20

Conditional Probabilities & Bayes’ Theroem

We have considered P (!), the probability of an event ! . Also obeying axioms of probabilityare conditional probabilities: P ("|!), the probability of the event " given that the event ! has

occurred.

P ("|!) !P ("!)

P (!)

Recast the axioms as

I. P ("|!) " 0

II. P ("|!) + P ("̄|!) = 1

III.

P ("! |#) = P ("|#)P (! |"#)= P (! |#)P ("|!#)

How does this relate to experiments? Use the product rule:

P (! |"#) =P (! |#)P ("|!#)

P ("|#)

or, letting M = model (or hypothesis), D = data and I = background information (assumptions),

P (M|DI) = P (M|I)P (D|MI)P (D|I)

Terms:

prior: P (M|I)

sampling distribution for D: P (D|MI) (also called likelihood for M)

prior predictive for D: P (D|I) (also called global likelihood for M or evidence for M)

21

Particular strengths of Bayesian method include:

1. One must often be explicit about what is assumed about I, the background information.

2. In assessing models, we get a PDF for parameters rather than just point estimates.

3. Occam’s razor (simpler models win, all else being equal) is easily invoked when comparingmodels. We may have many di!erent models, Mi that we wish to compare. Form the

odds ratio: from the posterior PDFs: P (Mi|DI):

Oi,j !P (Mi|DI)P (Mj |DI)

=P (Mi|I)P (Mj |I)

P (D|MiI)P (D|MjI)

.

22

Example

Data: {ki}, i = 1, . . . , n, drawn from Poisson process

Poisson PDF: Pk =!ke!!

k!

Want: mean of process

Frequentist approach:

We need an estimator for the mean; consider the likelihood

f(!) =n!

i=1

P (ki) =1

"ni=1 ki!

!!n

i=1kie!n!.

Maximizing,

df

d!= 0 = f(!)

#

!n + !!1n$

i=1

ki

%

we obtain an estimator for the mean is

k̄ =1

n

n$

i=1

ki.

23

Bayesian approach:

Likelihood (as before):

P (D|MI) =n!

i=1

P (ki) =1

"nı=1 ki!

!!n

i=1kie!n!.

Prior:P (M|I) = P (!|I)

AssumeP (!|I)! ! !U(!)

Prior Predictive:

P (D|I) "# "

!"d!U(!)P (D|MI) =

n!nx̄

"nı=1 ki!

!(nx̄).

Combining all the above, we find

P (!|{ki}I) =nnx̄

!(nx̄)!nx̄e!n! U(!)

Note that rather than getting a point estimate for the mean, we get a PDF for its value. Forhypothesis testing, this is much more useful than a point estimate.

Documents

Signal Modeling, Statistical Inference and Data Mining in ...hosting.astro.cornell.edu/~cordes/A6523/A6523_lecture_8.pdfSignal Modeling, Statistical Inference and Data Mining in Astrophysics