Introduction to Bayesian Inference

Introduction to Bayesian InferenceTom LoredoDept. of Astronomy, Cornell Universityhttp://www.astro.cornell.edu/staff/loredo/bayes/

CASt Summer School June 11, 2008

1 / 111

Outline1 The Big Picture2 FoundationsLogic & Probability Theory3 Inference With Parametric Models

Parameter EstimationModel Uncertainty4 Simple Examples

Binary OutcomesNormal DistributionPoisson Distribution5 Measurement Error Applications6 Bayesian Computation7 Probability & Frequency8 Endnotes: Hotspots, tools, reflections2 / 111




Scientific MethodScience is more than a body of knowledge; it is a way of thinking.The method of science, as stodgy and grumpy as it may seem,is far more important than the findings of science.Carl Sagan

Scientists argue!Argument Collection of statements comprising an act ofreasoning from premises to a conclusionA key goal of science: Explain or predict quantitativemeasurements (data!)Data analysis constructs and appraises arguments that reason fromdata to interesting scientific conclusions (explanations, predictions)4 / 111

The Role of Data

Data do not speak for themselves!We dont just tabulate data, we analyze data.We gather data so they may speak for or against existinghypotheses, and guide the formation of new hypotheses.A key role of data in science is to be among the premises inscientific arguments.

5 / 111

Data AnalysisBuilding & Appraising Arguments Using Data

Statistical inference is but one of several interacting modes ofanalyzing data.6 / 111

Bayesian Statistical Inference

A different approach to all statistical inference problems (i.e.,not just another method in the list: BLUE, maximumlikelihood, 2 testing, ANOVA, survival analysis . . . )

Foundation: Use probability theory to quantify the strength ofarguments (i.e., a more abstract view than restricting PT todescribe variability in repeated random experiments)

Focuses on deriving consequences of modeling assumptionsrather than devising and calibrating procedures

7 / 111

Frequentist vs. Bayesian StatementsI find conclusion C based on data Dobs . . .

Frequentist assessmentIt was found with a procedure thats right 95% of the timeover the set {Dhyp } that includes Dobs .Probabilities are properties of procedures, not of particularresults.

Bayesian assessmentThe strength of the chain of reasoning from Dobs to C is0.95, on a scale where 1= certainty.Probabilities are properties of specific results.Performance must be separately evaluated (and is typicallygood by frequentist criteria).

8 / 111




LogicSome EssentialsLogic can be defined as the analysis and appraisal of argumentsGensler, Intro to LogicBuild arguments with propositions and logicaloperators/connectives

Propositions: Statements that may be true or falseP:

Universe can be modeled with CDM

B:

is not 0

B:

not B, i.e., = 0

A:

tot [0.9, 1.1]

Connectives:AB :

AB :

A and B are both trueA or B is true, or both are10 / 111

ArgumentsArgument: Assertion that an hypothesized conclusion, H, followsfrom premises, P = {A, B, C , . . .} (take , = and)Notation:H|P :

Premises P imply H

H may be deduced from PH follows from P

H is true given that P is true

Arguments are (compound) propositions.Central role of arguments special terminology for true/false:

A true argument is validA false argument is invalid or fallacious11 / 111

Valid vs. Sound ArgumentsContent vs. form

An argument is factually correct iff all of its premises are true(it has good content).

An argument is valid iff its conclusion follows from itspremises (it has good form).

An argument is sound iff it is both factually correct and valid(it has good form and content).

We want to make sound arguments. Formal logic and probabilitytheory address validity, but there is no formal approach foraddressing factual correctness there is always a subjectiveelement to an argument.

12 / 111

Factual CorrectnessAlthough logic can teach us something about validity andinvalidity, it can teach us very little about factual correctness. Thequestion of the truth or falsity of individual statements is primarilythe subject matter of the sciences. Hardegree, Symbolic LogicTo test the truth or falsehood of premisses is the task ofscience. . . . But as a matter of fact we are interested in, and mustoften depend upon, the correctness of arguments whose premissesare not known to be true. Copi, Introduction to Logic

13 / 111

Premises

Facts Things known to be true, e.g. observed data

Reasonable or working assumptions E.g., Euclids fifthpostulate (parallel lines)

Desperate presumption!

Obvious assumptions Axioms, postulates, e.g., Euclidsfirst 4 postulates (line segment b/t 2 points; congruency ofright angles . . . )

Conclusions from other arguments

14 / 111

Deductive and Inductive InferenceDeductionSyllogism as prototypePremise 1: A implies HPremise 2: A is trueDeduction: H is trueH|P is valid

InductionAnalogy as prototypePremise 1: A, B, C , D, E all share properties x, y , zPremise 2: F has properties x, yInduction: F has property zF has z|P is not strictly valid, but may still be rational(likely, plausible, probable); some such arguments are strongerthan othersBoolean algebra (and/or/not over {0, 1}) quantifies deduction.Bayesian probability theory (and/or/not over [0, 1]) generalizes thisto quantify the strength of inductive arguments.15 / 111

Real Number Representation of InductionP(H|P) strength of argument H|PP = 0 Argument is invalid= 1 Argument is valid

(0, 1) Degree of deducibility

A mathematical model for induction:AND (product rule): P(A B|P) = P(A|P) P(B|A P)

= P(B|P) P(A|B P)

OR (sum rule): P(A B|P) = P(A|P) + P(B|P)P(A B|P)NOT:

P(A|P) = 1 P(A|P)

Bayesian inference explores the implications of this model.

16 / 111

Interpreting Bayesian ProbabilitiesIf we like there is no harm in saying that a probability expresses adegree of reasonable belief. . . . Degree of confirmation has beenused by Carnap, and possibly avoids some confusion. But whateververbal expression we use to try to convey the primitive idea, thisexpression cannot amount to a definition. Essentially the notioncan only be described by reference to instances where it is used. Itis intended to express a kind of relation between data andconsequence that habitually arises in science and in everyday life,and the reader should be able to recognize the relation fromexamples of the circumstances when it arises. Sir Harold Jeffreys, Scientific Inference

17 / 111

More On InterpretationPhysics uses words drawn from ordinary languagemass, weight,momentum, force, temperature, heat, etc.but their technical meaningis more abstract than their colloquial meaning. We can map between thecolloquial and abstract meanings associated with specific values by usingspecific instances as calibrators.A Thermal AnalogyIntuitive notion

Quantification

Calibration

Hot, cold

Temperature, T

Cold as ice = 273KBoiling hot = 373K

uncertainty

Probability, P

Certainty = 0, 1p = 1/36:plausible as snakes eyesp = 1/1024:plausible as 10 heads

18 / 111

A Bit More On InterpretationBayesianProbability quantifies uncertainty in an inductive inference. p(x)describes how probability is distributed over the possible values xmight have taken in the single case before us:p is distributed

P

x has a single,uncertain value

Frequentist

x

Probabilities are always (limiting) rates/proportions/frequencies inan ensemble. p(x) describes variability, how the values of x aredistributed among the cases in the ensemble:P

x is distributed

x

19 / 111

Arguments RelatingHypotheses, Data, and ModelsWe seek to appraise scientific hypotheses in light of observed dataand modeling assumptions.Consider the data and modeling assumptions to be the premises ofan argument with each of various hypotheses, Hi , as conclusions:Hi |Dobs , I . (I = background information, everything deemedrelevant besides the observed data)P(Hi |Dobs , I ) measures the degree to which (Dobs , I ) allow one todeduce Hi . It provides an ordering among arguments for various Hithat share common premises.Probability theory tells us how to analyze and appraise theargument, i.e., how to calculate P(Hi |Dobs , I ) from simpler,hopefully more accessible probabilities.20 / 111

The Bayesian RecipeAssess hypotheses by calculating their probabilities p(Hi | . . .)conditional on known and/or presumed information using therules of probability theory.Probability Theory Axioms:OR (sum rule):

P(H1 H2 |I )

AND (product rule): P(H1 , D|I )

= P(H1 |I ) + P(H2 |I )P(H1 , H2 |I )= P(H1 |I ) P(D|H1 , I )

= P(D|I ) P(H1 |D, I )NOT:

P(H1 |I )

=

1 P(H1 |I )

21 / 111

Three Important TheoremsBayess Theorem (BT)Consider P(Hi , Dobs |I ) using the product rule:P(Hi , Dobs |I ) = P(Hi |I ) P(Dobs |Hi , I )

= P(Dobs |I ) P(Hi |Dobs , I )

Solve for the posterior probability:P(Hi |Dobs , I ) = P(Hi |I )

P(Dobs |Hi , I )P(Dobs |I )

Theorem holds for any propositions, but for hypotheses &data the factors have names:posterior prior likelihoodnorm. const. P(Dobs |I ) = prior predictive

22 / 111

Law of Total Probability (LTP)Consider exclusive, exhaustive {Bi } (I asserts one of themmust be true),XXP(A, Bi |I ) =P(Bi |A, I )P(A|I ) = P(A|I )i

i

X

=

i

P(Bi |I )P(A|Bi , I )

If we do not see how to get P(A|I ) directly, we can find a set{Bi } and use it as a basisextend the conversation:XP(A|I ) =P(Bi |I )P(A|Bi , I )i

If our problem already has Bi in it, we can use LTP to getP(A|I ) from the joint probabilitiesmarginalization:XP(A|I ) =P(A, Bi |I )i

23 / 111

Example: Take A = Dobs , Bi = Hi ; thenXP(Dobs |I ) =P(Dobs , Hi |I )i

=

Xi

P(Hi |I )P(Dobs |Hi , I )

prior predictive for Dobs = Average likelihood for Hi(a.k.a. marginal likelihood)

NormalizationFor exclusive, exhaustive Hi ,XP(Hi | ) = 1i

24 / 111

Well-Posed ProblemsThe rules express desired probabilities in terms of otherprobabilities.To get a numerical value out, at some point we have to putnumerical values in.Direct probabilities are probabilities with numerical valuesdetermined directly by premises (via modeling assumptions,symmetry arguments, previous calculations, desperatepresumption . . . ).An inference problem is well posed only if all the neededprobabilities are assignable based on the premises. We may need toadd new assumptions as we see what needs to be assigned. Wemay not be entirely comfortable with what we need to assume!(Remember Euclids fifth postulate!)Should explore how results depend on uncomfortable assumptions(robustness).25 / 111

RecapBayesian inference is more than BTBayesian inference quantifies uncertainty by reportingprobabilities for things we are uncertain of, given specifiedpremises.It uses all of probability theory, not just (or even primarily)Bayess theorem.

The Rules in Plain English

Ground rule: Specify premises that include everything relevantthat you know or are willing to presume to be true (for thesake of the argument!).

BT: Make your appraisal account for all of your premises.

LTP: If the premises allow multiple arguments for ahypothesis, its appraisal must account for all of them.26 / 111




Inference With Parametric Models

Models Mi (i = 1 to N), each with parameters i , each imply asampling distn (conditional predictive distn for possible data):p(D|i , Mi )The i dependence when we fix attention on the observed data isthe likelihood function:Li (i ) p(Dobs |i , Mi )We may be uncertain about i (model uncertainty) or i (parameteruncertainty).

28 / 111

Three Classes of ProblemsParameter EstimationPremise = choice of model (pick specific i) What can we say about i ?

Model Assessment

Model comparison: Premise = {Mi } What can we say about i?

Model adequacy/GoF: Premise = M1 all alternatives Is M1 adequate?

Model AveragingModels share some common params: i = {, i } What can we say about w/o committing to one model?(Systematic error is an example)29 / 111

Parameter EstimationProblem statementI = Model M with parameters (+ any addl info)Hi = statements about ; e.g. [2.5, 3.5], or > 0Probability for any such statement can be found using aprobability density function (PDF) for :P( [, + d]| ) = f ()d

= p(| )d

Posterior probability densityp(|D, M) = R

p(|M) L()d p(|M) L()

30 / 111

Summaries of posterior

Best fit values:

Uncertainties:

Marginal distributions

maximizes p(|D, M) Mode, ,R Posterior mean, hi = d p(|D, M) Credible region of probabilityC:RC = P( |D, M) = d p(|D, M)Highest Posterior Density (HPD) region has p(|D, M) higherinside than outside Posterior standard deviation, variance, covariances Interesting parameters , nuisance parametersR Marginal distn for : p(|D, M) = d p(, |D, M)31 / 111

Nuisance Parameters and Marginalization

To model most data, we need to introduce parameters besidesthose of ultimate interest: nuisance parameters.

ExampleWe have data from measuring a rate r = s + b that is a sumof an interesting signal s and a background b.We have additional data just about b.What do the data tell us about s?

32 / 111

Marginal posterior distributionp(s|D, M) =

Z

db p(s, b|D, M)Z p(s|M) db p(b|s) L(s, b) p(s|M)Lm (s)

with Lm (s) the marginal likelihood for s. For broad prior,Lm (s) p(bs |s) L(s, bs ) bs

best b given sb uncertainty given s

Profile likelihood Lp (s) L(s, bs ) gets weighted by a parameterspace volume factor s2 = r2 + 2E.g., Gaussians: s = r b,bBackground subtraction is a special case of background marginalization.33 / 111

Model ComparisonProblem statementI = (M1 M2 . . .) Specify a set of models.Hi = Mi Hypothesis chooses a model.

Posterior probability for a modelp(D|Mi , I )p(D|I ) p(Mi |I )L(Mi )RBut L(Mi ) = p(D|Mi ) = di p(i |Mi )p(D|i , Mi ).p(Mi |D, I ) = p(Mi |I )

Likelihood for model = Average likelihood for its parametersL(Mi ) = hL(i )i

Varied terminology: Prior predictive = Average likelihood = Globallikelihood = Marginal likelihood = (Weight of) Evidence for model

34 / 111

Odds and Bayes factorsA ratio of probabilities for two propositions using the samepremises is called the odds favoring one over the other:Oij

=

p(Mi |D, I )p(Mj |D, I )p(Mi |I ) p(D|Mj , I )p(Mj |I ) p(D|Mj , I )

The data-dependent part is called the Bayes factor:Bij

p(D|Mj , I )p(D|Mj , I )

It is a likelihood ratio; the BF terminology is usually reserved forcases when the likelihoods are marginal/average likelihoods.

35 / 111

An Automatic Occams RazorPredictive probabilities can favor simpler modelsp(D|Mi ) =

Z

di p(i |M) L(i )

P(D|H)

Simple HComplicated H

D obs

D36 / 111

The Occam Factorp, L

Likelihood

Prior

p(D|Mi ) =

Z

di p(i |M) L(i ) p(i |M)L(i )i

ii= Maximum Likelihood Occam Factor L(i )

Models with more parameters often make the data moreprobable for the best fitOccam factor penalizes models for wasted volume ofparameter spaceQuantifies intuition that models shouldnt require fine-tuning37 / 111

Model AveragingProblem statementI = (M1 M2 . . .) Specify a set of modelsModels all share a set of interesting parameters, Each has different set of nuisance parameters i (or differentprior info about them)Hi = statements about

Model averagingCalculate posterior PDF for :Xp(|D, I ) =p(Mi |D, I ) p(|D, Mi )i

Xi

L(Mi )

Z

di p(, i |D, Mi )

The model choice is a (discrete) nuisance parameter here.38 / 111

Theme: Parameter Space VolumeBayesian calculations sum/integrate over parameter/hypothesisspace!(Frequentist calculations average over sample space & typically optimizeover parameter space.)

Marginalization weights the profile likelihood by a volumefactor for the nuisance parameters.

Model likelihoods have Occam factors resulting fromparameter space volume factors.

Many virtues of Bayesian methods can be attributed to thisaccounting for the size of parameter space. This idea does notarise naturally in frequentist statistics (but it can be added byhand).39 / 111

Roles of the PriorPrior has two roles

Incorporate any relevant prior informationConvert likelihood from intensity to measure Accounts for size of hypothesis space

Physical analogyHeat:

Q =

Probability: P

Z

Z

dV cv (r)T (r)d p(|I )L()

Maximum likelihood focuses on the hottest hypotheses.Bayes focuses on the hypotheses with the most heat.A high-T region may contain little heat if its cv is low or ifits volume is small.A high-L region may contain little probability if its prior is low or ifits volume is small.40 / 111

Recap of Key Ideas

Probability as generalized logic for appraising argumentsThree theorems: BT, LTP, NormalizationCalculations characterized by parameter space integrals Credible regions, posterior expectations Marginalization over nuisance parameters Occams razor via marginal likelihoods

41 / 111




Binary Outcomes:Parameter EstimationM = Existence of two outcomes, S and F ; each trial has sameprobability for S or FHi = Statements about , the probability for success on the nexttrial seek p(|D, M)D = Sequence of results from N observed trials:FFSSSSFSSSFS (n = 8 successes in N = 12 trials)Likelihood:p(D|, M) = p(failure|, M) p(success|, M) = n (1 )Nn

= L()

43 / 111

PriorStarting with no information about beyond its definition,use as an uninformative prior p(|M) = 1. Justifications: Intuition: Dont prefer any interval to any other of same size Bayess justification: Ignorance means that before doing theN trials, we have no preference for how many will be successes:

P(n success|M) =

1N +1

p(|M) = 1

Consider this a conventionan assumption added to M tomake the problem well posed.

44 / 111

Prior Predictive

p(D|M) =

Z

d n (1 )Nn

= B(n + 1, N n + 1) =A Beta integral, B(a, b)

R

n!(N n)!(N + 1)!

dx x a1 (1 x)b1 =

(a)(b)(a+b) .

45 / 111

Posterior

p(|D, M) =

(N + 1)! n (1 )Nnn!(N n)!

A Beta distribution. Summaries:n+1 0.64 Best-fit: = Nn = 2/3; hi = N+2q Uncertainty: = (n+1)(Nn+1) 0.12(N+2)2 (N+3)Find credible regions numerically, or with incomplete betafunctionNote that the posterior depends on the data only through n,not the N binary numbers describing the sequence.n is a (minimal) Sufficient Statistic.

46 / 111

47 / 111

Binary Outcomes: Model ComparisonEqual Probabilities?M1 : = 1/2M2 : [0, 1] with flat prior.

Maximum LikelihoodsM1 :

M2 :

p(D|M1 ) =

1= 2.44 1042N

n Nn12= 4.82 104L() =33p(D|M1 )= 0.51p(D| , M2 )

Maximum likelihoods favor M2 (failures more probable).48 / 111

Bayes Factor (ratio of model likelihoods)1;2N

and

B12

p(D|M1 )p(D|M2 )

p(D|M1 ) =

p(D|M2 ) =

n!(N n)!(N + 1)!

(N + 1)!n!(N n)!2N= 1.57

=

Bayes factor (odds) favors M1 (equiprobable).Note that for n = 6, B12 = 2.93; for this small amount ofdata, we can never be very sure results are equiprobable.If n = 0, B12 1/315; if n = 2, B12 1/4.8; for extremedata, 12 flips can be enough to lead us to strongly suspectoutcomes have different probabilities.(Frequentist significance tests can reject null for any sample size.)49 / 111

Binary Outcomes: Binomial DistributionSuppose D = n (number of heads in N trials), rather than theactual sequence. What is p(|n, M)?

LikelihoodLet S = a sequence of flips with n heads.p(n|, M) =

X

p(S|, M) p(n|S, , M)

n (1 )Nn[ # successes = n]

S

= n (1 )Nn Cn,NCn,N = # of sequences of length N with n heads. p(n|, M) =

N!n (1 )Nnn!(N n)!

The binomial distribution for n given , N.50 / 111

Posterior

p(|n, M) =

p(n|M) ==

N!nn!(Nn)! (1

p(n|M)

N!n!(N n)!1N +1

p(|n, M) =

)Nn

Z

d n (1 )Nn

(N + 1)! n (1 )Nnn!(N n)!

Same result as when data specified the actual sequence.51 / 111

Another Variation: Negative BinomialSuppose D = N, the number of trials it took to obtain a predifinednumber of successes, n = 8. What is p(|N, M)?

Likelihoodp(N|, M) is probability for n 1 successes in N 1 trials,times probability that the final trial is a success:p(N|, M) ==

(N 1)!n1 (1 )Nn (n 1)!(N n)!(N 1)!n (1 )Nn(n 1)!(N n)!

The negative binomial distribution for N given , n.

52 / 111

Posterior

p(|D, M) = Cn,N

p(D|M) =

p(|D, M) =

Cn,N

n (1 )Nnp(D|M)

Z

d n (1 )Nn

(N + 1)! n (1 )Nnn!(N n)!

Same result as other cases.

53 / 111

Final Variation: Meteorological StoppingSuppose D = (N, n), the number of samples and number ofsuccesses in an observing run whose total number was determinedby the weather at the telescope. What is p(|D, M )?(M adds info about weather to M.)

Likelihoodp(D|, M ) is the binomial distribution times the probabilitythat the weather allowed N samples, W (N):p(D|, M ) = W (N)Let Cn,N = W (N)

Nn

N!n (1 )Nnn!(N n)!

. We get the same result as before!

54 / 111

Likelihood PrincipleTo define L(Hi ) = p(Dobs |Hi , I ), we must contemplate what otherdata we might have obtained. But the real sample space may bedetermined by many complicated, seemingly irrelevant factors; itmay not be well-specified at all. Should this concern us?Likelihood principle: The result of inferences depends only on howp(Dobs |Hi , I ) varies w.r.t. hypotheses. We can ignore aspects of theobserving/sampling procedure that do not affect this dependence.This is a sensible property that frequentist methods do not share.Frequentist probabilities are long run rates of performance, anddepend on details of the sample space that are irrelevant in aBayesian calculation.Example: Predict 10% of sample is Type A; observe nA = 5 for N = 96Significance test accepts = 0.1 for binomial sampling;p(> 2 | = 0.1) = 0.12Significance test rejects = 0.1 for negative binomial sampling;p(> 2 | = 0.1) = 0.03

55 / 111

Inference With Normals/GaussiansGaussian PDF(x)21p(x|, ) = e 22 2

over [, ]

Common abbreviated notation: x N(, 2 )

Parameters = hxi 2

Z

dx x p(x|, )Z= h(x )2 i dx (x )2 p(x|, )

56 / 111

Gausss Observation: SufficiencySuppose our data consist of N measurements, di = + i .Suppose the noise contributions are independent, andi N(0, 2 ).Yp(D|, , M) =p(di |, , M)i

=

Yi

=

Yi

=

p(i = di |, , M)

1(di )2 exp 2 2 2

12e Q()/2 N (2)N/2

57 / 111

Find dependence of Q on by completing the square:XQ =(di )2i

=

Xi

di2 + N2 2Nd

= N( d)2 + Nr 2

where d

1 XdiNi

1 Xwhere r 2 (di d)2Ni

Likelihood depends on {di } only through d and r :

1N( d)2Nr 2L(, ) = Nexp 2 exp 22 2 (2)N/2

The sample mean and variance are sufficient statistics.This is a miraculous compression of informationthe normal distnis highly abnormal in this respect!58 / 111

Estimating a Normal MeanProblem specificationModel: di = + i , i N(0, 2 ), is known I = (, M).Parameter space: ; seek p(|D, , M)

Likelihood

Nr 2N( d)21expexp2 22 2 N (2)N/2

N( d)2 exp 2 2

p(D|, , M) =

59 / 111

Uninformative priorTranslation invariance p() C , a constant.This prior is improper unless bounded.

Prior predictive/normalizationZ

N( d)2p(D|, M) =d C exp 2 2 = C (/ N) 2

. . . minus a tiny bit from tails, using a proper prior.

60 / 111

Posterior

1N( d)2 exp p(|D, , M) =2 2(/ N) 2

Posterior is N(d, w 2 ), with standard deviation w = / N.68.3% HPD credible region for is d / N.Note that C drops out limit of infinite prior range is wellbehaved.

61 / 111

Informative Conjugate PriorUse a normal prior, N(0 , w02 )

PosteriorNormal N(, w 2 ), but mean, std. deviation shrink towardsprior.2Define B = w 2w+w 2 , so B < 1 and B = 0 when w0 is large.0Thene = (1 B) d + B 0e = w 1Bw

Principle of stable estimation: The prior affects estimatesonly when data are not informative relative to prior.

62 / 111

Estimating a Normal Mean: Unknown Problem specificationModel: di = + i , i N(0, 2 ), is unknownParameter space: (, ); seek p(|D, , M)

Likelihoodp(D|, , M) =

N( d)2Nr 21expexp2 22 2 N (2)N/21 Q/22eN

where

Q = N r 2 + ( d)2

63 / 111

Uninformative PriorsAssume priors for and are independent.Translation invariance p() C , a constant.Scale invariance p() 1/ (flat in log ).

Joint Posterior for , p(, |D, M)

1 Q()/22e N+1

64 / 111

Marginal Posteriorp(|D, M) Let =

Q2 2

so =

q

Q2

Z

d

1 N+1

e Q/2

and |d| = 3/2N/2

p(|D, M) 2

Q

N/2

Q N/2

Z

2

q

Q2

d

N12

e

65 / 111

Write Q =

Nr 2

p(|D, M) =

1+

dr

2

and normalize:

"

2 #N/2

1 ! 11 d1+N r/ N 23 ! r

N2N2

Students t distribution, with t = (d)r/ NA bell curve, but with power-law tailsLarge N:

p(|D, M) e N(d)

2 /2r 2

66 / 111

Poisson Distn: Infer a Rate from CountsProblem: Observe n counts in T ; infer rate, rLikelihoodL(r ) p(n|r , M) = p(n|r , M) =

(rT )n rTen!

PriorTwo simple standard choices (or conjugate gamma distn): r known to be nonzero; it is a scale parameter:p(r |M) =

11ln(ru /rl ) r

r may vanish; require p(n|M) Const:p(r |M) =

1ru67 / 111

Prior predictivep(n|M) ==

Z1 1 rudr (rT )n e rTru n! 0Z1 1 ru Td(rT )(rT )n e rTru T n! 01nfor ru ru TT

PosteriorA gamma distribution:p(r |n, M) =

T (rT )n rTen!

68 / 111

Gamma DistributionsA 2-parameter family of distributions over nonnegative x, withshape parameter and scale parameter s:p (x|, s) =

1 x 1 x/ses() s

Moments:E(x) = s

Var(x) = s 2

Our posterior corresponds to = n + 1, s = 1/T . Mode r = Tn ; mean hr i = n+1T (shift down 1 with 1/r prior) Std. devn r = n+1;credibleregions found by integrating (canTuse incomplete gamma function)

69 / 111

70 / 111

The flat priorBayess justification: Not that ignorance of r p(r |I ) = CRequire (discrete) predictive distribution to be flat:Zp(n|I ) =dr p(r |I )p(n|r , I ) = C p(r |I ) = C

Useful conventions

Use a flat prior for a rate that may be zeroUse a log-flat prior ( 1/r ) for a nonzero scale parameterUse proper (normalized, bounded) priorsPlot posterior with abscissa that makes prior flat

71 / 111

The On/Off ProblemBasic problem

Look off-source; unknown background rate bCount Noff photons in interval Toff

Look on-source; rate is r = s + b with unknown signal sCount Non photons in interval Ton

Infer s

Conventional solutionb = Noff /Toff ; b = Noff /Toffr = Non /Ton ; r = qNon /Tons = r b;s = r2 + b2

But s can be negative!

72 / 111

ExamplesSpectra of X-Ray SourcesBassani et al. 1989

Di Salvo et al. 2001

73 / 111

Spectrum of Ultrahigh-Energy Cosmic RaysNagano & Watson 2000

10

HiRes-2 MonocularHiRes-1 MonocularAGASA

3

Flux*E /10

24

2

-2

-1

-1

(eV m s sr )

HiRes Team 2007

1

17

17.5

18

18.5

19

19.5

20

20.5

21

log10(E) (eV)

74 / 111

N is Never LargeSample sizes are never large. If N is too small to get asufficiently-precise estimate, you need to get more data (or makemore assumptions). But once N is large enough, you can startsubdividing the data to learn more (for example, in a publicopinion poll, once you have a good estimate for the entire country,you can estimate among men and women, northerners andsoutherners, different age groups, etc etc). N is never enoughbecause if it were enough youd already be on to the nextproblem for which you need more data.Similarly, you never have quite enough money. But thats anotherstory. Andrew Gelman (blog entry, 31 July 2005)

75 / 111

Backgrounds as Nuisance Parameters

Background marginalization with Gaussian noiseMeasure background rate b = b b with source off. Measure totalrate r = r r with source on. Infer signal source strength s, wherer = s + b. With flat priors,#

2(s + b r )2(b b)expp(s, b|D, M) exp 2r22b2"

76 / 111

Marginalize b to summarize the results for s (complete thesquare to isolate b dependence; then do a simple Gaussianintegral over b):

(s s )2s = r bp(s|D, M) exp s2 = r2 + b22s2 Background subtraction is a special case of backgroundmarginalization.

77 / 111

Bayesian Solution to On/Off ProblemFirst consider off-source data; use it to estimate b:p(b|Noff , Ioff ) =

Toff (bToff )Noff e bToffNoff !

Use this as a prior for b to analyze on-source data. For on-sourceanalysis Iall = (Ion , Noff , Ioff ):p(s, b|Non ) p(s)p(b)[(s + b)Ton ]Non e (s+b)Ton

|| Iall

p(s|Iall ) is flat, but p(b|Iall ) = p(b|Noff , Ioff ), sop(s, b|Non , Iall ) (s + b)Non bNoff e sTon e b(Ton +Toff )

78 / 111

Now marginalize over b;Zp(s|Non , Iall ) =db p(s, b | Non , Iall )Zdb (s + b)Non bNoff e sTon e b(Ton +Toff )Expand (s + b)Non and do the resulting integrals:

Ci

NonX

Ton (sTon )i e sToni!i=0

iToff(Non + Noff i)!1+Ton(Non i)!

p(s|Non , Iall ) =

Ci

Posterior is a weighted sum of Gamma distributions, each assigning adifferent number of on-source counts to the source. (Evaluate viarecursive algorithm or confluent hypergeometric function.)

79 / 111

Example On/Off PosteriorsShort IntegrationsTon = 1

80 / 111

Example On/Off PosteriorsLong Background IntegrationsTon = 1

81 / 111




Empirical Number Counts DistributionsStar counts, galaxy counts, GRBs, TNOs . . .

13 TNO Surveys (c. 2001)BATSE 4B Catalog ( 1200 GRBs)2

F L/d [ cosmo, extinctn]

2 2F D 2 /(dd )

83 / 111

Selection Effects and Measurement Error

Selection effects (truncation, censoring) obvious (usually)Typically treated by correcting dataMost sophisticated: product-limit estimators

Scatter effects (measurement error, etc.) insidiousTypically ignored (average out?)84 / 111

Many Guises of Measurement ErrorAuger data above GZK cutoff (Nov 2007)

QSO hardness vs. luminosity (Kelly 2007)

85 / 111

HistoryEddington, Jeffreys (1920s 1940)m Uncertainty

n

m

n

^m

Malmquist, Lutz-Kelker

Joint accounting for truncation and (intrinsic) scatter in 2-Ddata (flux + distance indicator, parallax)Assume homogeneous spatial distribution

86 / 111

Many rediscoveries of scatter biases

Radio sources (1970s)Galaxies (Eddington, Malmquist; 1990s)Linear regression (1990s)GRBs (1990s)X-ray sources (1990s; 2000s)TNOs/KBOs (c. 2000)Galaxy redshift distns (2007+)

87 / 111

Accounting For Measurement ErrorIntroduce latent/hidden/incidental parametersSuppose f (x|) is a distribution for an observable, x.

From N precisely measured samples, {xi }, we can infer fromYL() p({xi }|) =f (xi |)i

88 / 111

Graphical representation

x1

x2

L() p({xi }|) =

xN

Yi

f (xi |)

89 / 111

But what if the x data are noisy, Di = {xi + i }?

We should somehow incorporate i (xi ) = p(Di |xi )L(, {xi }) p({Di }|, {xi })Y=i (xi )f (xi |)i

Marginalize (sum probabilities) over {xi } to summarize for .Marginalize over to summarize results for {xi }.Key point: Maximizing over xi and integrating over xi can givevery different results!90 / 111

Graphical representation

x1

x2

xN

D1

D2

DN

L(, {xi })

p({Di }|, {xi })YY=p(Di |xi )f (xi |) =i (xi )f (xi |)i

i

A two-level multi-level model (MLM).91 / 111

ExampleDistribution of Source FluxesMeasure m = 2.5 log(flux) from sources following a rolling power lawdistribution (inspired by trans-Neptunian objects)

f (m) 10[(m23)+ (m23)f(m)

2

]

23

m

Simulate 100 surveys of populations drawn from the same distn.Simulate data for photon-counting instrument, fixed count threshold.Measurements have uncertainties 1% (bright) to 30% (dim).

Analyze simulated data with maximum (profile) likelihood and Bayes.92 / 111

Parameter estimates from Bayes (circles) and maximum likelihood(crosses):

Uncertainties dont average out!

93 / 111

Smaller measurement error only postpones the inevitable:Similar toy survey, with parameters to mimic SDSS QSO surveys (few %errors at dim end):

94 / 111

Bayesian MLMs in Astronomy

Directional & spatio-temporal coincidences: GRB repetition (Luo+ 1996; Graziani+ 1996)

GRB host ID (Band 1998; Graziani+ 1999)

TNO/KBO magnitude distribution (Gladman+ 1998;Petit+ 2008)

VO cross-matching (Badav/ari & Szalay 2008)

Magnitude surveys/number counts/log Nlog S: GRB peak flux distn (Loredo & Wasserman 1998);

Dynamic spectroscopy: SN 1987A neutrinos, uncertainenergy vs. time (Loredo & Lamb 2002)

Linear regression: QSO hardness vs. luminosity (Kelly 2007)95 / 111




Bayesian ComputationLarge sample size: Laplace approximation Approximate posterior as multivariate normal det(covar) factors Uses ingredients available in 2 /ML fitting software (MLE, Hessian) Often accurate to O(1/N)

Low-dimensional models (d 5) Posterior samplingcreate RNG that samples posterior MCMC is most general framework

97 / 111




Probability & FrequencyFrequencies are relevant when modeling repeated trials, orrepeated sampling from a population or ensemble.Frequencies are observables:

When available, can be used to infer probabilities for next trialWhen unavailable, can be predicted

Bayesian/Frequentist relationships:

General relationships between probability and frequencyLong-run performance of Bayesian proceduresExamples of Bayesian/frequentist differences

99 / 111

Relationships Between Probability & FrequencyFrequency from probabilityBernoullis law of large numbers: In repeated i.i.d. trials, givenP(success| . . .) = , predictNsuccessNtotal

as

Ntotal

Probability from frequencyBayess An Essay Towards Solving a Problem in the Doctrineof Chances First use of Bayess theorem:Probability for success in next trial of i.i.d. sequence:E

NsuccessNtotal

as

Ntotal

100 / 111

Subtle Relationships For Non-IID CasesPredict frequency in dependent trialsrt = result of trial t; p(r1 , r2 . . . rN |M) known; predict f :1 Xp(rt = success|M)hf i =N tXXwherep(r1 |M) =p(r1 , r2 . . . |M3 )r2

rN

Expected frequency of outcome in many trials =average probability for outcome across trials.But also find that f neednt converge to 0.

Infer probabilities for different but related trialsShrinkage: Biased estimators of the probability that share infoacross trials are better than unbiased/BLUE/MLE estimators.A formalism that distinguishes p from f from the outset is particularlyvaluable for exploring subtle connections. E.g., shrinkage is explored viahierarchical and empirical Bayes.101 / 111

Frequentist Performance of Bayesian ProceduresMany results known for parametric Bayes performance: Estimates are consistent if the prior doesnt exclude the true value. Credible regions found with flat priors are typically confidenceregions to O(n1/2 ); reference priors can improve theirperformance to O(n1 ). Marginal distributions have better frequentist performance thanconventional methods like profile likelihood. (Bartlett correction,ancillaries, bootstrap are competitive but hard.) Bayesian model comparison is asymptotically consistent (not true ofsignificance/NP tests, AIC). For separate (not nested) models, the posterior probability for thetrue model converges to 1 exponentially quickly. Walds complete class theorem: Optimal frequentist methods areBayes rules (equivalent to Bayes for some prior) ...

Parametric Bayesian methods are typically good frequentist methods.(Not so clear in nonparametric problems.)102 / 111




Some Bayesian Astrostatistics Hotspots

Cosmology

Extrasolar planets

Photon counting data (X-rays, -rays, cosmic rays)

Gravitational wave astronomy

Parametric modeling of CMB, LSS, SNe Ia cosmo params Nonparametric modeling of SN Ia multicolor light curves Nonparametric emulation of cosmological models Parametric modeling of Keplerian reflex motion (planetdetection, orbit estimation) Optimal scheduling via Bayesian experimental design Upper limits, hardness ratios Parametric spectroscopy (line detection, etc.) Parametric modeling of binary inspirals Hi-multiplicity parametric modeling of white dwarf background104 / 111

Tools for Computational BayesAstronomer/Physicist Tools

BIE http://www.astro.umass.edu/~weinberg/proto_bie/Bayesian Inference Engine: General framework for Bayesian inference, tailored toastronomical and earth-science survey data. Built-in database capability tosupport analysis of terabyte-scale data sets. Inference is by Bayes via MCMC.

XSpec, CIAO/SherpaBoth environments have some basic Bayesian capability (including basic MCMCin XSpec)

CosmoMC http://cosmologist.info/cosmomc/Parameter estimation for cosmological models using CMB and other data viaMCMC

ExoFit http://zuserver2.star.ucl.ac.uk/~lahav/exofit.htmlAdaptive MCMC for fitting exoplanet RV data

CDF Bayesian Limit Softwarehttp://www-cdf.fnal.gov/physics/statistics/statistics_software.htmlLimits for Poisson counting processes, with background & efficiencyuncertainties

root http://root.cern.ch/Bayesian support? (BayesDivide)

Inference Forthcoming at http://inference.astro.cornell.edu/Several self-contained Bayesian modules (Gaussian, Poisson, directionalprocesses); Parametric Inference Engine (PIE) supports 2 , likelihood, and Bayes105 / 111

Python

PyMC http://trichech.us/pymcA framework for MCMC via Metropolis-Hastings; also implements Kalmanfilters and Gaussian processes. Targets biometrics, but is general.

SimPy http://simpy.sourceforge.net/Intro to SimPy http://heather.cs.ucdavis.edu/ matloff/simpy.html SimPy(rhymes with Blimpie) is a process-oriented public-domain package fordiscrete-event simulation.

RSPython http://www.omegahat.org/Bi-directional communication between Python and R

MDP http://mdp-toolkit.sourceforge.net/Modular toolkit for Data Processing: Current emphasis is on machine learning(PCA, ICA. . . ). Modularity allows combination of algorithms and other dataprocessing elements into flows.

Orange http://www.ailab.si/orange/Component-based data mining, with preprocessing, modeling, and explorationcomponents. Python/GUI interfaces to C + + implementations. Some Bayesiancomponents.

ELEFANT http://rubis.rsise.anu.edu.au/elefantMachine learning library and platform providing Python interfaces to efficient,lower-level implementations. Some Bayesian components (Gaussian processes;Bayesian ICA/PCA).106 / 111

R and S

CRAN Bayesian task viewhttp://cran.r-project.org/src/contrib/Views/Bayesian.htmlOverview of many R packages implementing various Bayesian models andmethods

Omega-hat http://www.omegahat.org/RPython, RMatlab, R-Xlisp

BOA http://www.public-health.uiowa.edu/boa/Bayesian Output Analysis: Convergence diagnostics and statistical and graphicalanalysis of MCMC output; can read BUGS output files.

CODAhttp://www.mrc-bsu.cam.ac.uk/bugs/documentation/coda03/cdaman03.htmlConvergence Diagnosis and Output Analysis: Menu-driven R/S plugins foranalyzing BUGS output

107 / 111

Java

Omega-hat http://www.omegahat.org/Java environment for statistical computing, being developed by XLisp-stat andR developers

Hydra http://research.warnes.net/projects/mcmc/hydra/HYDRA provides methods for implementing MCMC samplers using Metropolis,Metropolis-Hastings, Gibbs methods. In addition, it provides classesimplementing several unique adaptive and multiple chain/parallel MCMCmethods.

YADAS http://www.stat.lanl.gov/yadas/home.htmlSoftware system for statistical analysis using MCMC, based on themulti-parameter Metropolis-Hastings algorithm (rather thanparameter-at-a-time Gibbs sampling)

108 / 111

C/C++/Fortran

BayeSys 3 http://www.inference.phy.cam.ac.uk/bayesys/Sophisticated suite of MCMC samplers including transdimensional capability, bythe author of MemSys

fbm http://www.cs.utoronto.ca/~radford/fbm.software.htmlFlexible Bayesian Modeling: MCMC for simple Bayes, Bayesian regression andclassification models based on neural networks and Gaussian processes, andBayesian density estimation and clustering using mixture models and Dirichletdiffusion trees

BayesPack, DCUHREhttp://www.sci.wsu.edu/math/faculty/genz/homepageAdaptive quadrature, randomized quadrature, Monte Carlo integration

BIE, CDF Bayesian limits (see above)

109 / 111

Other Statisticians & Engineers Tools

BUGS/WinBUGS http://www.mrc-bsu.cam.ac.uk/bugs/Bayesian Inference Using Gibbs Sampling: Flexible software for the Bayesiananalysis of complex statistical models using MCMC

OpenBUGS http://mathstat.helsinki.fi/openbugs/BUGS on Windows and Linux, and from inside the R

XLisp-stat http://www.stat.uiowa.edu/~luke/xls/xlsinfo/xlsinfo.htmlLisp-based data analysis environment, with an emphasis on providing aframework for exploring the use of dynamic graphical methods

ReBEL http://choosh.csee.ogi.edu/rebel/Library supporting recursive Bayesian estimation in Matlab (Kalman filter,particle filters, sequential Monte Carlo).

110 / 111

Closing ReflectionsPhilip Dawid (2000)What is the principal distinction between Bayesian and classical statistics? It isthat Bayesian statistics is fundamentally boring. There is so little to do: justspecify the model and the prior, and turn the Bayesian handle. There is noroom for clever tricks or an alphabetic cornucopia of definitions and optimalitycriteria. I have heard people use this dullness as an argument againstBayesianism. One might as well complain that Newtons dynamics, being basedon three simple laws of motion and one of gravitation, is a poor substitute forthe richness of Ptolemys epicyclic system.All my experience teaches me that it is invariably more fruitful, and leads todeeper insights and better data analyses, to explore the consequences of being athoroughly boring Bayesian.

Dennis Lindley (2000)The philosophy places more emphasis on model construction than on formalinference. . . I do agree with Dawid that Bayesian statistics is fundamentallyboring. . . My only qualification would be that the theory may be boring but theapplications are exciting.

111 / 111

Documents

Introduction to Bayesian Inference