Error analysis lecture 7

Embed Size (px)

Citation preview

  • 7/30/2019 Error analysis lecture 7

    1/28

    Physics 509 1

    Physics 509: Dealing with Gaussian

    Distributions

    Scott OserLecture #7

    September 25, 2008

  • 7/30/2019 Error analysis lecture 7

    2/28

    Physics 509 2

    Outline

    Last time: we finished our initial tour of Bayesiananalysis, having studied the question of prior

    assignment in detail, particularly maximumentropy methods.

    Today we return to the Gaussian distribution andhighlight various important aspects:1) The Central Limit Theorem2) The Chebyshev Inequality

    3) Mathematics of multi-dimensional Gaussians4) Gaussian Error ellipses5) Marginalizing Gaussian distributions

    6) Application: the Rayleigh power test

  • 7/30/2019 Error analysis lecture 7

    3/28

    Physics 509 3

    The Central Limit Theorem

    If X is the sum of N independent random variables xi, each taken

    from a distribution with mean iand variance

    i2, then the

    distribution for X approaches a Gaussian distribution in the limit

    of large N. The mean and variance of this Gaussian are givenby:

    X= i

    VX= Vi= i2

  • 7/30/2019 Error analysis lecture 7

    4/28

    Physics 509 4

    The Central Limit Theorem: the caveats

    I said N independentvariables! Obviously the variables must individually have finite variances. I've said nothing about how fastthe distribution approaches a

    Gaussian as N goes to infinity. It is not necessary for all N variables to be identically distributed,provided that the following conditions apply (the Lyapunovconditions):

    sn2=

    i=1

    n

    i2

    A. rn3=

    i=1

    nEXi i

    3 must be finite for every n

    B. limn rn

    sn=0

    (This is the sum of the third central moments.)

    where

  • 7/30/2019 Error analysis lecture 7

    5/28

    Physics 509 5

    The Central Limit Theorem: the benefits

    The beauty of the central limit is that many very complicated factors alltend to come out in the wash. For example, normal distributions areused to model:

    the heights of adults in a population the logarithm of the heights/weights/lengths of members of a biological

    population the intensity of laser light changes in the logarithm of exchange rates, price indices, and stock

    market indices

    Note that sometimes we assume the logarithm of a quantity follows anormal distribution rather than the quantity itself. The reason is that ifthe various factors are all multiplicative rather than additive, then the

    central limit theorem doesn't apply to the sum, but it does to the sum ofthe logarithms. This is called a log-normal distribution.

    Normal distribution is more likely to describe the meat of the distribution

    than the tails.

  • 7/30/2019 Error analysis lecture 7

    6/28

    Physics 509 6

    CLT: How much is enough?

    How many independent variables do you need to add in order to get avery good normal distribution? Difficult question---depends on what thecomponent distributions look like.

    Best solution: simulate it.

    Possibly useful convergence theorems for identically distributedvariables:

    convergence is monotonic with N---as N increases the entropy of thedistribution monotonically increases to approach a normal distribution'sentropy (remember maximum entropy principles)

    if third central moment is finite, then speed of convergence (asmeasured by the difference between the true cumulative distribution

    and the normal cumulative distribution at a fixed point) is at least asfast as 1/sqrt(N).

  • 7/30/2019 Error analysis lecture 7

    7/28

    Physics 509 7

    CLT: Heliostat Example

    Heliostat: multi-faceted mirror that tracks the Sun

    Overlapping Sun images(each a uniform disk)

  • 7/30/2019 Error analysis lecture 7

    8/28

    Physics 509 8

    The Chebyshev inequality

    Suppose that you make a measurement, and get a result that is 3standard deviations from the mean. What is the probability ofthat happening?

    If you know the PDF (Gaussian or otherwise), of course youjust calculate it.

    But what if you don't know the form of the PDF?

    The Chebyshev inequality gives you a way to put an upperlimit on the probability of seeing an outlier---even when youknow almost nothing about the distribution.

  • 7/30/2019 Error analysis lecture 7

    9/28

    Physics 509 9

    Statement of the Chebyshev inequality

    Let X be a random variable for which Var(X) exists. Then for anygiven numbert>0the following is true:

    ProbXEXtVarX

    t2

    For example, the probability of getting a measurement that differsfrom the true mean by more than 3 is 0

    ProbXtEX

    t

  • 7/30/2019 Error analysis lecture 7

    10/28

    Physics 509 10

    Review of covariances of joint PDFs

    Consider some multidimensional PDFp(x1

    ...

    xn

    ). We define the

    covariance between any two variables by:

    cov x i, x j = dx px xix i x jx j

    The set of all possible covariances defines a covariance matrix,often denoted by V

    ij. The diagonal elements of V

    ijare the

    variances of the individual variables, while the off-diagonal

    elements are related to the correlation coefficients:

    Vij= [1

    212 12 ... 1 n 1n

    21 1 n 22

    ... 2n 2 n

    n1 1 n n2 2 n ... n2 ]

  • 7/30/2019 Error analysis lecture 7

    11/28

  • 7/30/2019 Error analysis lecture 7

    12/28

    Physics 509 12

    Approximating the peak of a PDF with amultidimensional Gaussian

    Suppose we havesomecomplicated-looking PDF in

    2D that has awell-definedpeak.

    How might weapproximate theshape of this PDFaround itsmaximum?

  • 7/30/2019 Error analysis lecture 7

    13/28

    Physics 509 13

    Taylor Series expansion

    Consider a Taylor series expansion of the logarithm of thePDF around its maximum at (x

    0,y

    0):

    logPx , y=P0Axx0B yy0Cxx02D yy0

    22Exx0yy0 ...

    Since we are expanding around the peak, then the firstderivatives must equal zero, so A=B=0. The remainingterms can be written in matrix form:

    logPx , yP0x , y

    C E

    E D

    x

    y

    In order for (x0,y

    0) to be a maximum of the PDF (and not a

    minimum or saddle point), the above matrix must be

    positive definite, and therefore invertible.

  • 7/30/2019 Error analysis lecture 7

    14/28

    Physics 509 14

    Taylor Series expansion

    Let me now suggestively denote the inverse of the abovematrix by V

    ij. It's a positive definite matrix with three

    parameters. In fact, I might as well call these parameters

    x,

    y, and.

    Exponentiating, we see that around its peak the PDF can beapproximated by a multidimensional Gaussian. The fullformula, including normalization, is

    logPx , yP0x ,y C E

    E D x

    y

    Px , y =1

    2x y 12

    exp

    {

    1

    2 12

    [ xx0

    x

    2

    yy0

    y

    2

    2

    xx0

    x

    yy0

    y

    ] }This is a good approximation as long as higher order terms inTaylor series are small.

  • 7/30/2019 Error analysis lecture 7

    15/28

    Physics 509 15

    Interpretation of multidimensional Gaussian

    Px , y =1

    2x y 12 exp

    {

    1

    2 12 [ xx0

    x 2

    yy0

    y 2

    2

    xx0

    x yy0

    y ] }Can I directly relate the free parameters to the covariance matrix?First calculate P(x) by marginalizing overy:

    Pxexp { 1212 xx0

    x 2

    } dy exp { 12 12 [ yy0

    y 2

    2 xx0x yy0

    y ]}Px exp

    {

    1

    2 12

    xx 0

    x

    2

    } dy exp

    {

    1

    2 12

    [ yy0

    y

    2

    2

    xx0

    x

    yy 0

    y

    2

    xx0

    x

    2

    2

    xx0

    x

    2

    ]}Px exp { 12 12 xx 0x

    2

    } dy exp { 12 12 [ yy0y xx0x 2

    2 xx0 x

    2

    ]}Px exp {

    12 12

    xx 0 x

    2

    }exp {

    2

    2 12 xx0

    x 2

    }=exp {12

    xx0 x

    2

    }So we get a Gaussian with width

    x. Calculations of

    ysimilar, and can also

    show thatis correlation coefficient.

  • 7/30/2019 Error analysis lecture 7

    16/28

    Physics 509 16

    P(x|y)

    Px , y =

    1

    2x y 12 exp

    {

    1

    2 12

    [ xx0

    x 2

    yy0

    y 2

    2

    xx0

    x yy0

    y ] }Note: if you view y as a fixed parameter, then the PDF P(x|y) is aGaussian with width of:

    x 12

    and a mean value of

    x0 xy

    yy0

    (It makes sense that the width of P(x|y) is always narrower than

    the width of the marginalized PDF P(x) (integrated over y). Ifyou know the actual value of y, you have additional informationand so a tighter constraint on x.

  • 7/30/2019 Error analysis lecture 7

    17/28

    Physics 509 17

    x=2

    y=1

    =0.8

    Red ellipse:contour with

    argument ofexponentialset to equal-1/2

    Blue ellipse:contourcontaining

    68% of 2Dprobabilitycontent.

  • 7/30/2019 Error analysis lecture 7

    18/28

    Physics 509 18

    Contour ellipses

    The contour ellipses are defined by setting the argument of theexponent equal to a constant. The exponent equals -1/2 on the

    red ellipse from the previous graph. Parameters of this ellipseare:

    Px , y =

    1

    2x y 12 exp

    {

    1

    2 12

    [ xx0

    x 2

    yy0

    y 2

    2

    xx0

    x yy0

    y ] }

    tan 2=2 x y

    x2 y

    2

    u=cos

    2

    x

    2sin

    2

    y

    2

    cos2 sin 2

    v=cos

    2

    y

    2sin

    2

    x

    2

    cos2sin

    2

  • 7/30/2019 Error analysis lecture 7

    19/28

    Physics 509 19

    Probability content inside a contour ellipse

    For a 1D Gaussian exp(-x2

    /22

    ), the 1 limits occur when theargument of the exponent equals -1/2. For a Gaussian there's a68% chance of the measurement falling within around the mean.

    But for a 2D Gaussian this is not the case. Easiest to see this forthe simple case ofx=

    y=1:

    1

    2 dx dy exp

    [

    1

    2

    x2y2

    ]=

    0

    r0

    drexp

    [

    1

    2

    r2

    ]=0.68

    Evaluating this integral and solving gives r02=2.3. So 68% of

    probability content is contained within a radius of(2.3).

    We call this the 2D contour. Note that it's bigger than the 1Dversion---if you pick points inside the 68% contour and plottheir x coordinates, they'll span a wider range than thosepicked from the 68% contour of the 1D marginalized PDF!

  • 7/30/2019 Error analysis lecture 7

    20/28

    Physics 509 20

    x=2

    y=1

    =0.8

    Red ellipse:contour with

    argument ofexponentialset to equal-1/2

    Blue ellipse:contourcontaining

    68% ofprobabilitycontent.

  • 7/30/2019 Error analysis lecture 7

    21/28

    Physics 509 21

    Marginalization by minimizationNormal marginalization

    procedure: integrateover y.

    For a multidimensionalGaussian, this gives

    the same answer asfinding the extrema ofthe ellipse---for everyx, find the the value ofy that maximizes thelikelihood.

    For example, at x=2the value of y which

    maximizes thelikelihood is justwhere the dashed linetouches the ellipse.The value of the

    likelihood at that pointthen is the value P(x)

  • 7/30/2019 Error analysis lecture 7

    22/28

    Physics 509 22

    Two marginalization proceduresNormal marginalization procedure: integrate over nuisance variables:

    Px= dy Px , y

    Alternate marginalization procedure: maximize the likelihood as a function of

    the nuisance variables, and return the result:

    Pxmaxy

    Px , y

    (It is not necessarily the case that the resulting PDF is normalized.)

    I can prove for Gaussian distributions that these two marginalizationprocedures are equivalent, but cannot prove it for the general case (In factthey give different results).

    Bayesians always follow the first prescription. Frequentists most often usethe second.

    Sometimes it will be computationally easier to apply one, sometimes theother, even for PDFs that are approximately Gaussian.

    Example of the CLT: Do werewolf sighting vary

  • 7/30/2019 Error analysis lecture 7

    23/28

    Physics 509 23

    Example of the CLT: Do werewolf sighting varywith phase of the moon?

  • 7/30/2019 Error analysis lecture 7

    24/28

    Physics 509 24

    Rayleigh power periodicity test

    Imagine doing a randomwalk in 2D. If all directions(phases) equally likely, nonet displacement. If somephases more likely than

    others, on average you geta net displacement.

    This motivates the

    Rayleigh power statistic:

    S=i=1

    N

    sin ti 2

    i=1

    N

    cos ti 2

    Really just the length(squared) of thedisplacement vector fromthe origin. For the werewolf

    data, S=8167.

  • 7/30/2019 Error analysis lecture 7

    25/28

    Physics 509 25

    Null hypothesis expectation for Rayleigh power

    So S=8167. Is that likely or not? What do we expect to get?

    If no real time variation, all phases are equally likely. It's like arandom walk in 2D. By Central Limit Theorem, total displacementin x or in y should be Gaussian with mean 0 and variance N2:

    2=1

    20

    2

    cos2 d=

    1

    20

    2

    sin2d=

    1

    2

    Since average displacements in x and y are uncorrelated (you cancalculate the covariance to show this), the joint PDF must be

    Px , y=2

    Nexp [ 1N x2y2 ]

    Since average displacements in x and y are uncorrelated (you cancalculate the covariance to show this), the joint PDF must be

    We can do a change of variables to get this as a 1D PDF in s=r2(marginalizing over the angle):

    Ps=1

    Ne

    s/N

  • 7/30/2019 Error analysis lecture 7

    26/28

    Physics 509 26

    So, do werewolves come out with the full moon?

    Data set had S=8167 for N=885. How likely is that?

    Assuming werewolf sightings occur randomly in time, then theprobability of getting s>8167 is:

    Ps= 8167

    1885

    es /885=exp [8167/885]=104

    Because this is a very small probability, we conclude thatwerewolves DO come out with the full moon.

    (Actually, we should only conclude that their appearances varywith a period of 28 days---maybe they only come out during thenew moon ...)

    Data was actually drawn from a distribution:

    Pt 0.90.1sin t

    H f l i h R l i h ?

  • 7/30/2019 Error analysis lecture 7

    27/28

    Physics 509 27

    How powerful is the Rayleigh power test?

    Rayleigh power

    probability:0.0001

    Compare to 2

    test on foldedphase diagram:

    2 =48.4/27 d.o.f.

    P = 0.0025

  • 7/30/2019 Error analysis lecture 7

    28/28

    Physics 509 28

    To save time, let's get started on Lecture 8.