Error analysis lecture 6

Embed Size (px)

Citation preview

  • 7/30/2019 Error analysis lecture 6

    1/28

    Physics 509 1

    Physics 509: Bayesian Priors

    Scott Oser Lecture #6

    September 23, 2008

  • 7/30/2019 Error analysis lecture 6

    2/28

    Physics 509 2

    OutlineLast time: we worked a few examples of Bayesiananalysis, and saw that Bayes' theorem provides amathematical justification for the principle known as

    Ockham's razor.

    Today:

    1) Dependence of Bayesian analysis on prior parameterization2) Advice on how to choose the right prior 3) Objective priors---quantifying our ignorance4) Maximum entropy priors5) What to do when you don't know how your data isdistributed?

  • 7/30/2019 Error analysis lecture 6

    3/28

    Physics 509 3

    Bayes' Theorem

    P H | D , I = P H | I P D | H , I

    P D | I

    Today we want to examine the science and/or art of how youshould choose a prior for a Bayesian analysis.

    If this were always easy, everyone would probably beBayesian.

  • 7/30/2019 Error analysis lecture 6

    4/28

    Physics 509 4

    Prior from a prior analysisThe best solution to any problem is to let someone else solve itfor you.

    If there exist prior measurements of the quantities you need toestimate, why not use them as your prior? (Duh!)

    Be careful, of course---if you have reason to believe that theprevious measurement is actually a mistake (not just astatistical fluctuation) you wouldn't want to include it.

    Even the most complicated statistical analysis does noteliminate the need to apply good scientific judgement andcommon sense.

  • 7/30/2019 Error analysis lecture 6

    5/28

    Physics 509 5

    Dependence on parameterizationTwo theorists set out to predict the mass of a new particleCarla (writes down theory):

    There should be a new particlewhose mass is greater than 0but less than 1, in appropriateunits. I have absolutely noother knowledge about themass, so I'll assume it hasequal chances of having anyvalue between zero and 1---

    i.e. P(m) = 1.

    Heidi (writes down the exactsame theory):

    There is a new particledescribed by a single freeparameter y=m 2 in the Klein-Gordon equation. I'm surethat the true value of y mustlie between 0 and 1. Since yis the quantity that appears inmy theory, and I know nothingelse about it, I'll assume auniform prior on y---i.e. P(y) =1.

    These are two valid statements of ignorance about the same theory, but withdifferent parameterizations.

  • 7/30/2019 Error analysis lecture 6

    6/28

    Physics 509 6

    An experiment reports: m=0.30.1The experimental apparatusnaturally measures m, so theexperiment reports that(rather than y). Our two

    theorists incorporate this newknowledge into their theory.Carla calculates a newprobability distribution P(m|D,I) for m. Heidi converts themeasurement into astatement about the quantityy, and calculates P(y|D,I).They then get together tocompare results. Heidi doesa change of variables on her PDF so she can directlycompare to Carla's result.

  • 7/30/2019 Error analysis lecture 6

    7/28

    Physics 509 7

    The sad truth: choice of parameterizationmatters

    It's quantitatively different tosay that all values of m areequally likely versus all values

    of m2

    are equally likely. Thelatter will favour larger valuesof m (if it's 50/50 that m 2 islarger than 0.5, then it's 50/50than m is larger than 0.707).

    Which is right? Statisticsalone cannot decide. Onlyyou can, based on physicalinsight, theoretical biases, etc.

    If in doubt, try it both ways.

  • 7/30/2019 Error analysis lecture 6

    8/28

    Physics 509 8

    Principle of IgnoranceIn the absence of any reason to distinguish one outcome fromanother, assign them equal probabilities.

    Example: you roll a 6-sided die. You have no reason to believe

    that the die is loaded. It's intuitive that you should assume thatall 6 outcomes are equally likely (p=1/6) until you discover areason to think otherwise.

    Example: a primordial black hole passing through our galaxy hitsEarth. We have no reason to believe it's more likely to comefrom one direction than any other. So we assume that the impactpoint is uniformly distributed over the Earth's surface.

    Parameterization note: this is not the same as assuming that all latitudes are equally likely!

  • 7/30/2019 Error analysis lecture 6

    9/28

    Physics 509 9

    Uniform Prior Suppose an unknown parameter refers to the location of something (e.g. a peak in a histogram). All positions seemequally likely.

    Imagine shifting everything by x'=x+c. We demand thatp(X|I) dX = P(X'|I) dX' = P(X'|I) dX. This is only true for all c if P(X) is a constant.

    Really obvious, perhaps ... if you are completely ignorant aboutthe location of something, use a uniform prior for your initialguess of that location.

    Note: although a properly normalized uniform prior has a finiterange, you can often get away with using a uniform prior from- to + as long as the product of the prior and the likelihood isfinite.

  • 7/30/2019 Error analysis lecture 6

    10/28

    Physics 509 10

    Jeffreys Prior Suppose an unknown parameter measures the size of something, and that we have no good idea how big the thingwill be (1mm? 1m? 1km?). We are ignorant about the scale .Put another way, our prior should have the same form no

    matter what units we use to measure the parameter with. If T'=T, then

    P T I dT = P T ' I dT ' = p T ' I dT

    P T I = P T I , which is only true for all if

    P T I =constant

    T

    Properly normalized from T min to T max this is:

    P T I =1

    T ln T max

    / T min

  • 7/30/2019 Error analysis lecture 6

    11/28

    Physics 509 11

    Modified Jeffreys Prior What if your parameter could equal zero? Jeffreys prior is notnormalizable---it blows up for T min=0, with probability 1 thatT

  • 7/30/2019 Error analysis lecture 6

    12/28

    Physics 509 12

    Given enough data, priors don't matter

    The more constrainingyour data becomes, theless the prior matters.

    When posterior distribution is your muchnarrower than prior, theprior won't vary muchover the region of interest. Most priorsapproximate to flat inthis case.

    Consider the case of estimating p for abinomial distributionafter observing 20 or 100 coin flips.

  • 7/30/2019 Error analysis lecture 6

    13/28

    Physics 509 13

    A prior gotcha

    Maybe an obvious point ... if your prior ever equals zero at some value, then your posterior distribution must equal zero at that value aswell, no matter what your data says.

    Be cautious about choosing priors that are

    identically zero over any range of interest.

  • 7/30/2019 Error analysis lecture 6

    14/28

    Physics 509 14

    Objective priors

    Much criticism of Bayesian analysis concerns

    the fact that the result of the analysis dependson the choice of prior, and that the assignmentof this prior seems rather subjective.

    Is there some objective way of assigning aprior in the case that we know little about itspossible distribution?

  • 7/30/2019 Error analysis lecture 6

    15/28

    Physics 509 15

    Shannon's Entropy theorem1948: remarkable paper by Charles Shannon generalizes theconcept of entropy to a probability distribution:

    S p1, p 2, ... , pn =i = 1

    n

    p i ln p i

    Some remarkable properties:

    1) The entropy has the same form as the thermodynamic equivalent(modulo Boltzmann's constant).

    2) It is a measure of the information content of the distribution. If allthe p i =0 except for one, then S=0---we have a perfect constraint. Asuncertainty increases, so does S.3) The entropy is related to data compression---it is the smallestaverage number of bits needed to encode a message. (Talk to ColinGay if you want details.)4) If S is really a measure of the information content of thedistribution, and we want to assign a prior that reflects our ignoranceof the true value for our parameter, we should assign a prior probability distribution that maximizes S.

  • 7/30/2019 Error analysis lecture 6

    16/28

    Physics 509 16

    Maximum Entropy Principle

    The distributions at theleft are various

    probability distributions

    for the outcomes froma 6-sided die, with theentropy superimposed.

    Using the one with thelargest entropy as your prior results in theweakest constraint(widest uncertainty) onthe posterior PDF.

  • 7/30/2019 Error analysis lecture 6

    17/28

    Physics 509 17

    Maximum Entropy Principle With A Constraint

    Often we're not totally ignorant of the prior. For example, perhapsyou

    know the mean value of the distributionknow its varianceknow the average value of some function of the parameter in

    question

    These all are examples of constraints. The maximum entropy prior will then be the probability distribution P(x) that maximizes

    subject to any constraints that may apply.

    S = dx P x ln P x

  • 7/30/2019 Error analysis lecture 6

    18/28

    Physics 509 18

    Finding the probabilities by a variationalmethodThe mathematical statement of the problem is to find a set of probabilities p 1 ... p n that maximizes the function

    S p1 ... p n =i = 1

    n

    pi ln p i

    The mathematical statement of the problem is to find a set of probabilities p 1 ... p n that maximizes the function

    If all of the p i were independent, this would simply imply:

    dS =S p1

    dp 1S p2

    dp 2 ...S pn

    dp n= 0

    Treating the p i as independent, all of the coefficients must equalzero, and in fact you will wind up concluding that all of the p i areequal (a uniform prior). This is a mathematical statement of the

    ignorance principle.

  • 7/30/2019 Error analysis lecture 6

    19/28

    Physics 509 19

    Incorporating constraints with Lagrangianmultipliers

    dS dC = S p1

    C p1

    dp 1 ... S p nC pn

    dp n= 0

    Suppose now we impose some constraint on the probabilitydistribution, of the general form C(p 1 ... p n)=0. Then

    dC = C

    p1 dp 1C p2 dp 2 ...

    C pn dp n

    =0

    Therefore dS dC = 0 and so

    We now set the first coefficient equal to zero, giving us an equation

    for , and then we are left with a set of simultaneous equations thatcan be solved for p 1.

  • 7/30/2019 Error analysis lecture 6

    20/28

    Physics 509 20

    Max Ent prior with only normalizationconstraintOne constraint always applies: probabilities should sum to 1:

    C = p i= 1

    dS dC =S p i

    C p i

    dp i= ln p i 1 dp i= 0

    One constraint always applies: probabilities should sum to 1:

    Allowing the pito vary independent and so setting coefficients

    equal to zero gives:

    pi = e1

    We plug back into the constraint equation to determine .

  • 7/30/2019 Error analysis lecture 6

    21/28

    Physics 509 21

    Max Ent prior when you know the mean

    C = pi= 1

    d [ i = 1n

    pi ln p i 0i = 1

    n

    pi 1 1i= 1

    n

    yi p i ]= 0

    Suppose we have two constraints---normalization and mean.

    We plug this back into the constraint equations to determine 0 and 1. The 0 factor is a boring normalization term. But theother factor sets the mean value of the distribution.

    yi p i=

    i= 1

    n

    ln p i 1 0 1 yi = 0

    pi = e1 0 e 1 yi

  • 7/30/2019 Error analysis lecture 6

    22/28

    Physics 509 22

    Max Ent prior: setting the mean

    Solve numerically for 1.

    i = 1

    n

    yi p i= = yi e

    1 y i

    e 1 yi

    pi = e1 0 e 1 yi

  • 7/30/2019 Error analysis lecture 6

    23/28

    Physics 509 23

    Max Ent prior when you know the variance

    What happens if you constrain the variance to equal 2? (Let'sassume here the mean is also known.)

    Let's suppose you have prior upper and lower limits on your parameter.

    In the limit that the variance is small compared to the range of theparameter:

    ymax 1 and ymin 1

    then the Max Ent distribution with the specified variance is aGaussian:

    P y = 1 2

    e y2 / 2 2

  • 7/30/2019 Error analysis lecture 6

    24/28

    Physics 509 24

    A Gaussian is the least constrainingassumption for the error distribution

    A very useful and surprising result follows from this maximumentropy argument. Suppose your data is scattered around your model with an unknown error distribution:

    In this example each pointis scattered around themodel by an error uniformly

    distributed between -1 and+1.

    But suppose I don't knowhow the errors aredistributed. What's themost conservative thing Ican assume?

    A Gaussian error distrib.

  • 7/30/2019 Error analysis lecture 6

    25/28

    Physics 509 25

    Consider three possible error models

    I don't know how the errors are distributed, but I happen to knowthe RMS of the data around the model by some means. (MaybeZeus told me.) I consider three possible models for the error:

    uniform, Gaussian, and parabolic.

  • 7/30/2019 Error analysis lecture 6

    26/28

    Physics 509 26

    Posterior probability distributions for the threeerror models

    These aremarginalized PDFs.

    Caveat: although inthis case the true error distribution gave thetightest parameter constraints, it'sperfectly possible for an incorrectassumption about theerror distribution to

    give inappropriatelytight constraints!

  • 7/30/2019 Error analysis lecture 6

    27/28

    Physics 509 27

    What if you don't know the RMS?

    Imagine that the data is so sparse that you don't already knowthe scatter of the data around the model.

    One possibility is to assume a Gaussian distribution for theerrors a la the maximum entropy principle, but to leave 2 as afree parameter. Assign it a physically plausible prior (possibly aJeffreys prior over physically plausible range) and just treat it asa nuisance parameter.

    This is more or less like fitting for the size of the error.

  • 7/30/2019 Error analysis lecture 6

    28/28

    Physics 509 28

    A very difficult cutting-edge problem ...

    Maximum entropy has an important subtlety when dealing withcontinuous distributions. The continuous case is:

    The weird function m(x) is really the number density of points inparameter space as you go from the discrete case to the continuouslimit.

    It's really not obvious what m(x) should be. If you know it already, youcan use maximum entropy to calculate priors given additionalconstraints. But if you know absolutely nothing, you can't even definem(x) . If you like m(x) is the prior given no constraints at all.

    To get beyond this you must use other principles---for example, usetransformation symmetries to generate m(x) . A common solution isthe general Jeffreys prior---choose a prior that is invariant under a

    parameter transformation.

    S =

    dx p x lnp xm x