Bayesian Reasoning: Maximum Entropy A/Prof Geraint F. Lewis Rm 560: [email protected]

Bayesian Reasoning:Maximum Entropy

A/Prof Geraint F. LewisRm 560: [email protected]

http://www.physics.usyd.edu.au/~gfl/LectureLecture 8

Common SenseWe have spent quite a bit of time exploring the posterior probability distribution, but, of course, to calculate this we need to use the likelihood function and our prior knowledge.

However, how our prior knowledge is encoded is the biggest source of argument about Bayesian statistics, with cries of subjective choice influencing outcomes (but shouldn’t this be the case?)

Realistically, we could consider a wealth of prior probability distributions that agree with constraints (i.e. the mean is specified), but which do we choose?

Answer: we pick the one which is maximally non-committal about missing information.


Shannon’s theoremIn 1948, Shannon developed a measure on the uncertainty of a probability distribution which he labeled Entropy. He showed that the uncertainty of a discrete probability distribution is

Jaynes argued that the maximally non-committal probability distribution is the one with the maximum entropy; hence, of all possible probability distributions we should choose the one that maximizes S.

The other distributions will imply some sort of correlation (we’ll see this in a moment).


ExampleYou are told that an experiment has two possible outcomes; what is the maximally non-committal distribution you should assign to the two outcomes?

Clearly, if we assign p1=x, then p2=(1-x) and the entropy is

The maximum value of the entropy occurs at p1=p2=1/2.

But isn’t this what you would have guessed?

If we have any further information (i.e. the existence of any correlations between the outcome of 1 and 2) we can build this into our measure above and re-maximize.


The Kangaroo JustificationSuppose you are given some basic information about the population of Australian kangaroos;

1) 1/3 of kangaroos have blue eyes

2) 1/3 of kangaroos are left handed

How many kangaroos are blue eyed and left handed? We know that;

Blue eyes Left-Handed

True False

True p1 p2

False p3 p4


The Kangaroo JustificationWhat are the options?

1) Independent case (no correlation)

2) Maximal positive correlation

3) Maximal negative correlation


The Kangaroo JustificationSo there are a range of potential p1 values (which set all the other values), but which do we choose?Again, we wish to be non-committal and not assume any prior correlations (unless we have evidence to support any particular prior).

What constraint can we put on {pi} to select this particular case;

So the variational function that selects the non-committal case is the entropy. As we will see, this is very important for image reconstruction.

Variation function Optimal z Implied Correlation

- pi ln( pi ) 1/9 uncorrleated

- pi2 1/12 negative

ln( pi ) 0.1303 positive

pi1/2 0.1218 positive


Incorporating a priorSection 8.4 of the textbook discusses a justification of the MaxEnt approach, considering the rolling of a weighted die and examining the “multiplicity” of the outcomes (i.e. some potential outcomes are more likely than others).Suppose you have some prior information you want to incorporate some prior information into entropy measure, so we have {mi} prior estimates of our probabilities {pi}, following the arguments we see that the quantity we want to maximize is the Shannon-Jaynes entropy

If mi are equal, this has no influence on the maximization; we will see this is important in considering image reconstruction.


Incorporating a priorWhen considering a continuous probability distribution, the entropy becomes

where m(y) is known as the Lebesgue measure.

This quantity (which still encodes our prior) ensure that the entropy is insensitive to a change of coordinates (as m(x) and p(x) transform the same way).


Some examplesSuppose you are told some experiment has n possible outcomes. Without further information, what prior distribution would you assign the outcomes? Your prior estimates of the outcomes (without additional information) would be to assign {mi } = 1/n; what does MaxEnt say the values of {pi } should be?

The quantity we maximize is our entropy with a Lagrange Multiplier to account for the constraint on the probability;


Some examplesTaking the (partial) derivative of Sc with respect to the pi and multiplier , we can show that

and so

All that is left is to evaluate , which we get from the constraint so

Given the constraints on {mi}, =-1 and {mi}= {pi}.


Nicer examplesWhat if you have additional constraints, such as knowing the mean of the outcome? Then your constrained entropy is

Where we now have two Lagrange multipliers, one each for each of the constraints.

Through the same procedure, we can look for the maximum, and find;

Generally, solving for either is difficult analytically, but is straight-forward numerically.


Nicer examplesSuppose that you are told that a die has a mean score of dots per roll; what is the probability weighting of each face? If these are equal, the die is unweighted and fair. If, however, the probabilities are different, we should suppose that the die is unfair.

If =3.5, it’s easy to show from the constraints that 0=-1 and 1=0 (write out the two constraints in terms of the previous equation and divide out 0). If we have no prior reason to thing otherwise, each face would be weighted equally and so the final result is that {pi} = {mi}.

The result is as we expect; the an (unweighted) average of 3.5, the most probable distribution is that all faces have equal weight.


Nicer examplesSuppose, however, you were told that the mean was =4.5, what is the most probable distribution for {pi}? We can follow the same procedure as in the previous example, but now find that 0=-0.37 and 1=0.49; with this, the distribution in {pi} is

As we would expect, the distribution is now skewed to the higher die faces (increasing the mean on a sequence of rolls).


Additional constraintsAdditional information will provide additional constraints on the probability distribution. If we know a mean and a variance, then;

Given what we have seen previously, we should expect the solution to be of the form (when taking the continuum limit) of

Which, when appropriately normalized, is the (expected) Gaussian distribution.


Image ReconstructionIn science, we are interested in gleaning underlying physical properties from data sets, although in general data contains signals which are blurred (through optics or physical effects), with added noise (such as photon arrival time or detector noise). So, how do we extract our image from the blurry, noisy data?


Image ReconstructionNaively, you might assume that you can simply “invert” the process and recover the original image. However, the problem is ill-posed, and a “deconvolution” will amplify the noise in a (usually) catastrophic way. We could attempt to suppress the noise (e.g. Wiener filtering) but isn’t there another way?


Image ReconstructionOur image consists of a series of pixels, each with a photon count of Ii. We can treat this as a probability distribution, such that

The value in each pixel, therefore, is the probability that the next photon will arrive in that pixel.

Note that for an image, pi≥0, and so we are dealing with a “positive, additive distribution” (note, this is important, as some techniques like to add negative flux in regions to improve a reconstruction).


Image ReconstructionWe can apply Bayes theorem to calculate the posterior probability of a proposed “true” image, Imi, from the data. Following the argument given in the text, we see that


Image ReconstructionSo we aim to maximize

The method, therefore, requires us to have a method for generating proposal images (i.e. throwing down blobs of light), convolving with our blurring function (to give Iji) and comparing to the data through 2.

The requirements on pi ensures that proposal image is everywhere positive (which is good!).

What does the entropy term do? It provides a “regularization” which drives the solution towards our prior distribution (mi) while the 2 drives a fit to the data. Note, however, we sometimes need to add additional regularization terms to enforce smoothness on the solution.


Image Reconstruction

Here is an example of MaxEnt reconstruction with differing point-spread functions (psf) and added noise.

Exactly what you get back depends on the quality of your data, in each case you can read the recovered message.



Reconstruction of the radio galaxy M87 (Bryan & Skilling 1980) using MaxEnt. Note the reduction in the noise and higher detail visible in the radio jet.



Not always a good thing!! (MaxEnt)

Documents

Bayesian Reasoning: Maximum Entropy A/Prof Geraint F. Lewis Rm 560: [email protected]