Maximum Entropy Distributions

8/9/2019 Maximum Entropy Distributions

1/30

This is page iPrinter: Opaque this

Contents

4 Maximum entropy distributions 14.1 Bayes theorem and consistent reasoning . . . . . . . . . . . 14.2 Maximum entropy distributions . . . . . . . . . . . . . . . . 34.3 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.4 Discrete maximum entropy examples1 . . . . . . . . . . . . 5

4.4.1 Discrete uniform . . . . . . . . . . . . . . . . . . . . 54.4.2 Discrete nonuniform . . . . . . . . . . . . . . . . . . 5

4.5 Normalization and partition functions . . . . . . . . . . . . 64.5.1 Special case: Logistic shape, Bernoulli distribution . 7

4.6 Combinatoric maximum entropy examples . . . . . . . . . . 84.6.1 Binomial . . . . . . . . . . . . . . . . . . . . . . . . 84.6.2 Multinomial . . . . . . . . . . . . . . . . . . . . . . . 104.6.3 Hypergeometric . . . . . . . . . . . . . . . . . . . . . 12

4.7 Continuous maximum entropy examples . . . . . . . . . . . 134.7.1 Continuous uniform . . . . . . . . . . . . . . . . . . 144.7.2 Known mean . . . . . . . . . . . . . . . . . . . . . . 144.7.3 Gaussian (normal) distribution and known variance 144.7.4 Lognormal distribution . . . . . . . . . . . . . . . . 15

4.8 Maximum entropy posterior distributions . . . . . . . . . . 16

4.8.1 Sequential processing for three states . . . . . . . . . 17

1 A table summarizing some maximum entropy probability assignments is found at

the end of the chapter (see Park and Bera [2009] for additional maxent assignments).


2/30

ii Contents

4.8.2 Another frame: sequential maximum relative entropy 204.8.3 Simultaneous processing of moment and data condi-

tions . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.9 Convergence or divergence in beliefs . . . . . . . . . . . . . 234.9.1 diverse beliefs . . . . . . . . . . . . . . . . . . . . . . 234.9.2 complete ignorance . . . . . . . . . . . . . . . . . . . 244.9.3 convergence in beliefs . . . . . . . . . . . . . . . . . 24

4.10 Appendix: summary of maximum entropy probability as-signments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26


3/30

This is page 1Printer: Opaque this

4

Maximum entropy distributions

Bayesian analysis illustrates scientific reasoning where consistent reasoning(in the sense that two individuals with the same background knowledge,evidence, and perceived veracity of the evidence reach the same conclusion)is fundamental. In the next few pages we survey foundational ideas: Bayestheorem (and its product and sum rules), maximum entropy probabilityassignment, and consistent updating with maximum entropy probabilityassignments (posterior probability assignment).

4.1 Bayes theorem and consistent reasoning

Consistency is the hallmark of scientific reasoning. When we consider prob-ability assignment to events, whether they are marginal, conditional, or

joint events their assignments should be mutually consistent (match upwith common sense). This is what Bayes product and sum rules expressformally.

The product rulesays the product of a conditional likelihood (or distri-bution) and the marginal likelihood (or distribution) of the conditioningvariable equals the joint likelihood (or distribution).

p (x, y) = p (x|y)p (y)

= p (y|x)p (x)

Thesum rulesays if we sum over all events related to one (set of) vari-able(s) (integrate out one variable or set of variables), we are left with the


4/30

2 4. Maximum entropy distributions

likelihood (or distribution) of the remaining (set of) variables(s).

p (x) =

n!i=1

p (x, yi)

p (y) =n!

j=1

p(xj, y)

Bayes theorem combines these ideas to describe consistent evaluation ofevidence. That is, the posterior likelihood associated with a proposition, ! ,given the evidence, y , is equal to the product of the conditional likelihoodof the evidence given the proposition and marginal or prior likelihood ofthe conditioning variable (the proposition) scaled by the likelihood of theevidence. Notice, weve simply rewritten the product rule where both sidesare divided by p (y) and p (y) is simply the sum rule where ! is integratedout of the joint distribution, p (!, y).

p (!| y) = p (y| !)p (!)

p (y)

For Bayesian analyses, we often find it convenient to suppress the normal-izing factor, p (y), and write the posterior distribution is proportional tothe product of the sampling distribution or likelihood function and priordistribution.

p (! | y) ! p (y| !)p (!)or for a particular draw y = y0

p (!| y = y0) ! " (!| y = y0)p (!)

where p (y| !) is the sampling distribution, " (!| y = y0) is the likelihoodfunction evaluated aty = y0, andp (!)is the prior distribution for !. Bayestheorem is the glue that holds consistent probability assignment together.

Example 1 Consider the following joint distribution:

p (y= y1, != !1) p (y = y2, ! = !1) p (y= y1, != !2) p (y= y2, != !2)0.1 0.4 0.2 0.3

The sum rule yields the following marginal distributions:

p (y= y1) p (y = y2)0.3 0.7

andp (! = !1) p (! = !2)

0.5 0.5


5/30

4.2 Maximum entropy distributions 3

The product rule gives the conditional distributions:

p (y| ! = !1) p (y| ! = !2)

y1 0.2 0.4y2 0.8 0.6

and

p (!| y = y1) p (! | y = y2)

!113

47

!223

37

as common sense dictates.

4.2 Maximum entropy distributionsFrom the above, we see that evaluation of propositions given evidence isentirely determined by the sampling distribution, p (y| !), or likelihoodfunction," (!| y), and the prior distribution for the proposition,p (!). Con-sequently, assignment of these probabilities is a matter of some considerableimport. How do we proceed? Jaynes suggests we take account of our back-ground knowledge,", and evaluate the evidence in a manner consistentwith both background knowledge and evidence. That is, the posterior like-lihood (or distribution) is more aptly represented by

p (!| y,") ! p (y| !,")p (! | ")

Now, were looking for a mathematical statement of what we know andonly what we know. For this idea to be properly grounded requires a senseof complete ignorance (even though this may never represent our state ofbackground knowledge). For instance, if we think that 1 is more likely themean or expected value than 2 then we must not be completely ignorantabout the location of the random variable and consistency demands thatour probability assignment reflect this knowledge. Further, if the order ofevents or outcomes is notexchangeable(if one permutation is more plausiblethan another), then the events are not seen as stochastically independentor identically distributed.1 The mathematical statement of our backgroundknowledge is defined in terms of Shannons entropy (or sense of di!usionor uncertainty).

1 Exchangeability is foundational for independent and identically distributed events

(iid), a cornerstone of inference. However, exchangeability is often invoked in a con-ditional sense. That is, conditional on a set of variables exchangeability applies a

foundational idea of conditional expectations or regression.


6/30


4.3 Entropy

Shannon defines entropy as2

h= #n!i=1

pi $ log(pi)

for discrete events wheren!i=1

pi= 1

or di!erential entropy as3

h= #" p (x)logp (x) dxfor events with continuous support where

" p (x) dx= 1

Shannon derived a measure of entropy so that five conditions are satisfied:(1) a measure h exists, (2) the measure is smooth, (3) the measure ismonotonically increasing in uncertainty, (4) the measure is consistent inthe sense that if di!erent measures exist they lead to the same conclusions,and (5) the measure is additive. Additivity says joint entropy equals the

entropy of the signals (y) plus the probability weighted average of entropyconditional on the signals.

h (x, y) = H(y) + Pr(y1)H(x|y1) + +P r(yn)H(x|yn)

This latter term,Pr(y1)H(x|y1)+ +Pr(yn)H(x|yn), is called conditionalentropy. Internal logical consistency of entropy is maintained primarily viaadditivity the analog to the workings of Bayes theorem for probabilities.

2 The axiomatic development for this measure of entropy can be found in Jaynes[2003] or Accounting and Causal E!ects: Econometric Challenges, ch. 13.

3 Jaynes [2003] argues that Shannons di!erential entropy logically includes an invari-ance measure m (x) such that di!erential entropy is defined as

h= !

! p (x)log

p (x)

m (x)dx


7/30

4.4 Discrete maximum entropy examples6 5

4.4 Discrete maximum entropy examples4

4.4.1 Discrete uniform

Example 2 Suppose we know only that there are three possible (exchange-able) events, x1 = 1, x2 = 2,andx3 = 3. The maximum entropy probabilityassignment is found by solving the Lagrangian

L % maxpi

##

3!i=1

pilogpi # (#0 # 1)$

3!i=1

pi # 1%&

First order conditions yield

pi= e!!0 for i= 1, 2, 3

and#0 = log 3

Hence, as expected, the maximum entropy probability assignment is a dis-crete uniform distribution withpi =

13 for i= 1, 2, 3.

4.4.2 Discrete nonuniform

Example 3 Now suppose we know a little more. We know the mean is2.5.5 The Lagrangian is now

L % maxpi

##

3!i=1

pilogpi # (#0 # 1)$

3!i=1

pi # 1%# #1$

3!i=1

pixi # 2.5%&

First order conditions yield

pi= e!!0!xi!1 for i= 1, 2, 3

and

#0 = 2.987

#1 = #0.834The maximum entropy probability assignment is

p1 = 0.116

p2 = 0.268

p3 = 0.616

4 A table summarizing some maximum entropy probability assignments is found at

the end of the chapter (see Park and Bera [2009] for additional maxent assignments).5 Clearly, if we knew the mean is 2 then we would assign the uniform discrete distri-

bution above.


8/30


4.5 Normalization and partition functions

The above analysis suggests a general approach for assigning probabilities

where normalization is absorbed into a denominator. Since

exp[##0]n!

k=1

exp

'(# m!

j=1

#jfj(xi)

)*= 1,

p (xi) =exp[##0]exp+#,mj=1 #jfj(xi)-

1

=exp[##0]exp+#,mj=1 #jfj(xi)-

exp[##0],nk=1exp +

#,

mj=1 #jfj(xi)-

=exp+#,mj=1 #jfj(xi)-,n

k=1exp+#,mj=1 #jfj(xi)-

= k (xi)

Z(#1, . . . ,#m)

where fj(xi) is a function of the random variable, xi, reflecting what weknow,7

k (xi) = exp

'(# m!

j=1

#jfj(xi)

)*

is a kernel, and

Z(#1, . . . ,#m) =n!

k=1

exp

'(# m!

j=1

#jfj(xk)

)*

is a normalizing factor, called a partition function.8 Probability assignmentis completed by determining the Lagrange multipliers, #j , j = 1, . . . , m,from the m constraints which are function of the random variables.

7 Since !0 simply ensures the probabilities sum to unity and the partition functionassures this, we can define the partition function without !0. That is, !0 cancels as

demonstrated above.8 In physical statistical mechanics, the partition function describes the partitioning

among di!erent microstates and serves as a generator function for all manner of results

regarding a process. The notation, Z, refers to the German word for sum over states,zustandssumme. An example with relevance for our purposes is

!"log Z

"!1=


9/30

4.5 Normalization and partition functions 7

Return to the example above. Since we know support and the mean,n= 3 and the function f(xi) = xi. This implies

Z(#1) =3!

i=1

exp[##1xi]

and

pi = k (xi)

Z(#1)

= exp[##1xi],3

k=1exp [##1xk]

where x1 = 1, x2 = 2, and x3 = 3. Now, solving the constraint

3!i=1

pixi # 2.5 = 0

3!i=1

exp[##1xi],3k=1exp [##1xk]

xi # 2.5 = 0

produces the multiplier, #1 = #0.834, and identifies the probability assign-ments

p1 = 0.116

p2 = 0.268

p3 = 0.616

We utilize the analog to the above partition function approach next toaddress continuous density assignment as well as a special case involvingbinary (Bernoulli) probability assignment which takes the shape of a logisticdistribution.

4.5.1 Special case: Logistic shape, Bernoulli distribution

Example 4 Suppose we know a binary (0, 1) variable has mean equal to35 . The maximum entropy probability assignment by the partition function

where is the mean of the distribution. For the example below, we find

!"log Z

"!1=

3 + 2e!1 + e2!1

1 + e!1 + e2!1= 2.5

Solving gives !1 = !0.834 consistent with the approach below.


10/30


approach is

p (x) = exp[## (1)]

exp[## (0)] + exp [## (1)]=

exp[##]1 + exp [##]

= 1

1 + exp [#]

This is the shape of the density for a logistic distribution (and would be alogistic density if support were unbounded rather than binary).9 Solving for# = log 23 or p (x= 1) =

35 , reveals the assigned probability of success or

characteristic parameter for a Bernoulli distribution.

4.6 Combinatoric maximum entropy examples

Combinatorics, the number of exchangeable ways events occur, plays a keyrole in discrete (countable) probability assignment. The binomial distribu-tion is a fundamental building block.

4.6.1 Binomial

Suppose we know there are binary (Bernoulli or "success"/"failure") out-comes associated with each of n draws and the expected value of successequals np. The expected value of failure is redundant, hence there is onlyone moment condition. Then, the maximum entropy assignment includesthe number of combinations which produce x1 "successes" and x0 "fail-

ures". Of course, this is the binomial operator,. nx1,x0/

= n!

x1!x0! wherex1 + x0 = n. Let mi%.

nx1i,x0i

/ and generalize the entropy measure to

account for m,

s= #n,

i=1pilog

0pi

mi

1Now, the kernel (drawn from the Lagrangian) is

ki= miexp [##1xi]Satisfying the moment condition yields the maximum entropy or binomialprobability assignment.

p (x, n) = ki

Z

= . n

x1i,x0i/px1 (1#p)x0 , x1+x0 = n

0 otherwise

9 This suggests logistic regression is a natural (as well as the most common) strategy

for modeling discrete choice.


11/30

4.6 Combinatoric maximum entropy examples 9

Example 5 (balanced coin) Suppose we assignx1 = 1for each coin flipresulting in a head, x0 = 1 for a tail, and the expected value is E[x1] =E[x0] = 6 in 12 coin flips. We know there are 2

12 = ,x1+x0=12. 12x1,x0/ =

4, 096 combinations of heads and tails. Solving #1 = 0, the maximum en-tropy probability assignment associated with x equal x1 heads and x0 tailsin12 coin flips is

p (x, n= 12) =

.12x1

/12

x112

x0, x1+x0 = 12

0 otherwise

Example 6 (unbalanced coin) Continue the coin flip example above ex-cept heads are twice as likely as tails. In other words, the expected valuesare E[x1] = 8 and E[x0] = 4 in 12 coin flips. Solving #1 =# log2, themaximum entropy probability assignment associated withs heads in12 coin

flips is

p (x, n= 12) =

.12x1

/23

x113

x0, x1+x0 = 12

0 otherwise

p (x, n= 12)[x1, x0] balanced coin unbalanced coin

[0, 12] 14,0961

531,441

[1, 11] 124,09624

531,441

[2, 10] 664,096

264531,441

[3, 9] 220

4,0961,760

531,441

[4, 8] 4954,0967,920

531,441

[5, 7] 7924,09625,344531,441

[6, 6] 9244,09659,136531,441

[7, 5] 7924,096101,376531,441

[8, 4] 4954,096

126,720531,441

[9, 3] 2204,096112,640531,441

[10, 2] 664,09667,584531,441

[11, 1] 124,096 24,576531,441

[12, 0] 14,0964,096

531,441


12/30


Canonical analysis

The above is consistent with a canonical analysis based on relative entropy,ipilog2 pipold,i3

wherepold,ireflects probabilities assigned based purely on

the number of exchangeable ways of generatingxi; hence,pold,i= (nxi)!

i (nxi

)=

mi!imi

. Since the denominator ofpold,i is absorbed via normalization it can

be dropped, then entropy reduces to,

ipilog2pimi

3and the kernel is the

same as aboveki = miexp [##1x1i]

4.6.2 Multinomial

The multinomial is the multivariate analog to the binomial accommodatingk rather than two nominal outcomes. Like the binomial, the sum of the

outcomes equalsn,,

ki=1xi = n where xk = 1for each occurrence of eventk. We know there are

kn =!

x1++xk=n

0 n

x1, , xk

1=!

x1++xk=n

n!

x1! xk!

possible combinations of k outcomes, x = [x1, , xk], in n trials. Fromhere, probability assignment follows in analogous fashion to that for thebinomial case. That is, knowledge of the expected values associated withthe k events leads to the multinomial distribution when the n!x1!xk! ex-

changeable ways for each occurrence x is taken into account, E[xj ] =npjforj = 1, . . . , k#1. Onlyk#1moment conditions are employed as the kthmoment is a linear combination of the others, E[xk] = n

#,k!1j=1E[xj ] =

n2

1#,k!1j=1pj3.The kernel is

n!

x1! xk!exp [##1x1 # # #k!1xk!1]

which leads to the standard multinomial probability assignment when themoment conditions are resolved

p (x, n) = p (x, n) = n!x1!xk!p

x11 p

xkk ,,k

i=1xi = n,,k

i=1pi = 1

0 otherwise

Example 7 (one balanced die) Suppose we roll a balanced die (k = 6)one time (n = 1), the moment conditions are E[x1] = E[x6] =

16 .

Incorporating the number of exchangeable ways to generate n = 1 results,the kernel is

1!

x1! x6!exp [##1x1 # # #5x5]


13/30


14/30


p (x, n= 2)[x1, x2, x3, x4, x5, x6] balanced dice unbalanced dice

[2, 0, 0, 0, 0, 0] 136

1

441

[0, 2, 0, 0, 0, 0] 1364

441

[0, 0, 2, 0, 0, 0] 136

9441

[0, 0, 0, 2, 0, 0] 13616441

[0, 0, 0, 0, 2, 0] 13625441

[0, 0, 0, 0, 0, 2] 13636441

[1, 1, 0, 0, 0, 0] 1184

441

[1, 0, 1, 0, 0, 0] 1186

441

[1, 0, 0, 1, 0, 0] 118

8441

[1, 0, 0, 0, 1, 0] 118 10441

[1, 0, 0, 0, 0, 1] 11812441

[0, 1, 1, 0, 0, 0] 11812441

[0, 1, 0, 1, 0, 0] 11816441

[0, 1, 0, 0, 1, 0] 11820441

[0, 1, 0, 0, 0, 1] 11824441

[0, 0, 1, 1, 0, 0] 11824441

[0, 0, 1, 0, 1, 0] 118

30441

[0, 0, 1, 0, 0, 1] 118

36

441

[0, 0, 0, 1, 1, 0] 118

40441

[0, 0, 0, 1, 0, 1] 11848441

[0, 0, 0, 0, 1, 1] 11860441

4.6.3 Hypergeometric

The hypergeometric probability assignment is a purely combinatorics exer-cise. Suppose we taken drawswithoutreplacement from a finite populationofN items of which m are the target items (or events) and x is the num-ber of target items drawn. There are .

mx/ ways to draw the targets times.N!mn!x / ways to draw nontargets out of.Nn/ ways to make n draws. Hence,

our combinatoric measure is

(mx)(N!mn!x )

(Nn) , x & {max (0, m+n #N) , min(m, n)}


15/30

4.7 Continuous maximum entropy examples 13

andmin(m,n)

!x=max(0,m+n!N).mx

/.N!mn!x

/.Nn/ = 1

which completes the probability assignment as there is no scope for momentconditions or maximizing entropy.10

Example 11 (hypergeometric) Suppose we have an inventory of sixitems (N = 6) of which one is red, two are blue, and three are yellow.We wish to determine the likelihood of drawingx= 0, 1, or2 blue items intwo draws without replacement (m= 2, n= 2).

p (x= 0, n= 2) =

.20

/.6!22!0/.62

/=

4

6

3

5=

6

15

p (x= 1, n= 2) =

.21

/.6!22!1/.62

/=

2

6

4

5+

4

6

2

5=

8

15

p (x= 2, n= 2) =

.22

/.6!22!2/.62

/=

2

6

1

5=

1

15

4.7 Continuous maximum entropy examples

The partition function approach for continuous support involves densityassignment

p (x) =exp+#,mj=1 #jfj(x)-4b

aexp+#,mj=1 #jfj(x)- dx

10 If we omit"Nn

#in the denominator, it would be captured via normalization. In other

words, the kernel is

k=$mx

%$N! mn! x

%exp[0]

and the partition function is

Z=

min(m,n)&x=max(0,m+n!N)

k=$Nn

%


16/30


where support is between aand b.

4.7.1 Continuous uniformExample 12 Suppose we only know support is between zero and three.The above partition function density assignment is simply (there are noconstraints so there are no multipliers to identify)

p (x) = exp[0]430

exp[0] dx=

1

3

Of course, this is the density function for a uniform with support from0 to3.

4.7.2 Known mean

Example 13 Continue the example above but with known mean equal to1.35. The partition function density assignment is

p (x) = exp[##1x]430

exp[##1x] dx

and the mean constraint is " 30

xp (x) dx # 1.35 = 0

" 3

0

x exp[##1x]430

exp[##1

x] dxdx # 1.35 = 0

so that #1 = 0.2012, Z =430

exp[##1x] dx = 2.25225, and the densityfunction is a truncated exponential distribution with support from0 to 3.

p (x) = 0.444exp[#0.2012x] , 0 ' x ' 3

The base (non-truncated) distribution is exponential with mean approxi-mately equal to 5 (4.9699).

p (x) = 14.9699 exp5# x4.9699 6 , 0 ' x < (

p (x) = 0.2012exp[#0.2012x]

4.7.3 Gaussian (normal) distribution and known variance

Example 14 Suppose we know the average dispersion or variance is$2 =100. Then, a finite mean must exist, but even if we dont know it, we can


17/30

4.7 Continuous maximum entropy examples 15

find the maximum entropy density function for arbitrary mean, . Usingthe partition function approach above we have

p (x) =exp+##2(x# )2-4"

!" exp+##2(x # )2-

dx

and the average dispersion constraint is""!"

(x# )2p (x) dx# 100 = 0

""!"

(x # )2exp+##2(x # )2-

4"!" exp+##2(x# )2-

dx# 100 = 0

so that#2 = 12"2 and the density function is

p (x) = 1)

2%$exp

##(x# )

2

2$2

&

= 1)

2%10exp

##(x# )

2

200

&

Of course, this is a Gaussian or normal density function. Strikingly, theGaussian distribution has greatest entropy of any probability assignmentwith the same variance.

4.7.4 Lognormal distribution

Example 15 Suppose we know the random variable has positive support(x >0) then it is natural to work with (natural) logarithms. If we also know

E[log x] = = 1 as well as E+

(log x)2-

= $2 = 10, then the maximum

entropy probability assignment is the lognormal distribution

p (x) = 1x#2#"

exp+# (logx!)22"2-

, 0< x,,$ < (

Again, we utilize the partition function approach to demonstrate.

p (x) =exp+##1log x# #2(log x)2-

4"0

exp

+##1log x# #2(log x)2-

dx

and the constraints are

E[log x] =

""0

log xp (x) dx# 1 = 0


18/30


and

E

+(log x)2

-=

""0

(log x)2p (x) dx# 10 = 0

so that#1 = 0.9 and#2 = 12"2 = 0.05. Substitution gives

p (x) ! exp7#2(9)

20 log x # 1

20(log x)

2

8

Completing the square and adding in the constant (from normalization)exp5# 1206 gives

p (x) ! exp[# log x]exp7# 1

20+

2

20log x# 1

20(log x)28

Rewriting produces

p (x) ! exp 5log x!16 exp##(log x# 1)220

&

which simplifies as

p (x) ! x!1 exp##(log x# 1)

2

20

&

Including the normalizing constants yields the probability assignment as-serted above

p (x) = 1x#2#10

exp+# (logx!1)22(10)-

, 0< x < (

4.8 Maximum entropy posterior distributions

We employ maximum entropy to choose among a priori probability distribu-tions subject to our knowledge of moments (of functions) of the distribution(e.g., the mean). When new evidence is collected, were typically interestedin how this data impacts our posterior beliefs regarding the parameterspace. Of course, Bayes theorem provides guidance. For some problems,we simply combine our maximum entropy prior distribution with the like-lihood function to determine the posterior distribution. Equivalently, wecan find (again, by Lagrangian methods) the maximum relative entropyposterior distribution conditional first on the moment conditions then con-

ditional on the data so that the data eventually outweighs the momentconditions.

When sequential or simultaneous processing of information produces thesame inferences we say the constraints commute, as in standard Bayesian


19/30

4.8 Maximum entropy posterior distributions 17

updating. However, when one order of the constraints reflects di!erentinformation (addresses di!erent questions) than a permutation, the con-straints are noncommuting. For noncommuting constraint problems, con-

sistent reasoning demands we modify our approach from the standard oneabove (Gi"n and Caticha [2007]). Moment constraints along with dataconstraints are typically noncommuting. Therefore, we augment standardBayesian analysis with a "canonical" factor. We illustrate the di!erencebetween sequential and simultaneous processing of moment and data con-ditions via a simple three state example.

4.8.1 Sequential processing for three states

Suppose we have a three state process (!1, !2, !3) where, for simplicity,

!i& {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8}and

3!i=1

!i= 1

Well refer to this a die generating process where !1 refers to outcome oneor two, !2 to outcome three or four, and !3 corresponds to die face five orsix. The maximum entropy prior with no moment conditions is

pold(!1, !2, !3) = 1,8j=1j

= 1

36

for all valid combinations of(!1, !2, !3)and the likelihood function is multino-mial

p (x| !) =(n1+n2+n3)!

n1!n2!n3! !n11 !

n22 !

n33

where ni represents the number of i draws in the sample ofn1+ n2+ n3total draws.

Now, suppose we know, on average, state one (!1) is twice as likely asstate three (!3). That is, the process produces "die" with these averageproperties. The maximum entropy prior given this moment condition isgenerated from solving

maxp($)

h= #,$p (!1, !2, !3)logp (!1, !2, !3)s.t.,

$p (!1, !2, !3) (!1 # 2!3) = 0Lagrangian methods yield11

p (!) = exp[#

(!1 # 2

!3)]Z

11 Frequently, we write p (#) in place ofp (#1, #2, #3) to conserve space.


20/30


where the partition function is

Z=!$

exp[# (!1

#2!3)]

and # is the solution to

#&log Z

= 0

In other words,

p (!) =exp[1.44756 (!1 # 2!3)]

28.7313

and

p (x, !) = p (x| !)p (!)

= (n1+n2+n3)!

n1!n2!n3! !n11 !

n22 !

n33

exp [1.44756 (!1 # 2!3)]28.7313

Hence, prior to collection and evaluation of evidence expected values of!are

E[!1] =!x

!$

!1p (x, !)

=!$

!1p (!) = 0.444554

E[!2] =!x

!$

!2p (x, !)

=!$

!2p (!) = 0.333169

E[!3] =!x

!$

!3p (x, !)

=!$

!3p (!) = 0.222277

whereE[!1+ !2+ !3] = 1andE[!1] = 2E[!3]. In other words, probabilityassignment matches the known moment condition for the process.

Next, we roll one "die" ten times to learn about this specific die and theoutcomex is m = {n1 = 5, n2 = 3, n3 = 2}. The joint probability is

p (!, x= m) = 2520!51!32!

23

exp[1.44756(!1 # 2!3)]28.7313

the probability of the data is

p (x) =!$

p (!, x= m) = 0.0286946


21/30

4.8 Maximum entropy posterior distributions 19

and the posterior probability of! is

p (! | x = m) = 3056.65!51!32!

23exp [1.44756 (!1 # 2!3)]

Hence,

E[!1 | x = m] =!$

p (!| x = m) !1

= 0.505373

E[!2 | x = m] =!$

p (!| x = m) !2

= 0.302243

E[!3 | x = m] =!$

p (!| x = m) !3

= 0.192384

and

E[!1+ !2+ !3 | x = m] = 1

However, noticeE[!1 # 2!3 | x = m] *= 0. This is because our priors includ-ing the moment condition refer to the process and here were investigatinga specific "die".


22/30


To help gauge the relative impact of priors and likelihoods on posteriorbeliefs we tabulate the above results and contrast with uninformed priorsas well as alternative evidence. It is important to bear in mind we subscribe

to the notion that probability beliefs represent states of knowledge.

background knowledge uninformed E[!1 # 2!3] = 0Pr (!) 136

exp[1.44756($1!2$3)]28.7313

E

'( !1!2

!3

)*'( 131

313

)*'( 0.4445540.333169

0.22277

)*

Pr

9:!;;;;;

'( n1 = 5n2 = 3

n3 = 2

)*0 exp [##x]f(x) =

1exp+#x-

,

0< x < (

gammaE[x] = a >0,

E[log x] = !#

(a)!(a)

exp[##1x # #2log x]f(x) =

1!(a)exp [#x] xa!1,

0< x 0,

E[log x] = !#

( 12 )!( 12 )

+log2

exp[##1x # #2log x]f(x) = 1

2$/2!(%/2)

exp5#x2 6x%/2!1,

0< x < (

beta

E[log x] =!#

(a)![a] # !

#

(a+b)![a+b] ,

E[log (1# x)] =!#

(b)!(b)# !

#

(a+b)![a+b] ,

a, b >0

exp

7 ##1log x##2log (1# x)8 f(x) = 1B(a,b)

xa!1 (1# x)b!1 ,0< x 0 exp[## log x] f(x) =

&x%mx%+1 ,

0< xm' x < (

Laplace E[|x|] = 1' ,

)> 0 exp[## |x|] f(x) ='

2 exp[#)|x|] ,#( ' x < (

Wishartn d.f.

E[tr (")] = ntr (#)n > p# 1

" symmetric,positive definite

E[log |"|] = log |#|+$ (n) , real

exp

7 ##1tr (")##2log |"|8 f(") =|"|!n2|#|n!p!12

2np2 !(n2 )

exp

7# tr("

!1#)

2

8

Documents

Maximum Entropy Distributions