Maximum Entropy Distributions

Embed Size (px)

Citation preview

  • 8/9/2019 Maximum Entropy Distributions

    1/30

    This is page iPrinter: Opaque this

    Contents

    4 Maximum entropy distributions 14.1 Bayes theorem and consistent reasoning . . . . . . . . . . . 14.2 Maximum entropy distributions . . . . . . . . . . . . . . . . 34.3 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.4 Discrete maximum entropy examples1 . . . . . . . . . . . . 5

    4.4.1 Discrete uniform . . . . . . . . . . . . . . . . . . . . 54.4.2 Discrete nonuniform . . . . . . . . . . . . . . . . . . 5

    4.5 Normalization and partition functions . . . . . . . . . . . . 64.5.1 Special case: Logistic shape, Bernoulli distribution . 7

    4.6 Combinatoric maximum entropy examples . . . . . . . . . . 84.6.1 Binomial . . . . . . . . . . . . . . . . . . . . . . . . 84.6.2 Multinomial . . . . . . . . . . . . . . . . . . . . . . . 104.6.3 Hypergeometric . . . . . . . . . . . . . . . . . . . . . 12

    4.7 Continuous maximum entropy examples . . . . . . . . . . . 134.7.1 Continuous uniform . . . . . . . . . . . . . . . . . . 144.7.2 Known mean . . . . . . . . . . . . . . . . . . . . . . 144.7.3 Gaussian (normal) distribution and known variance 144.7.4 Lognormal distribution . . . . . . . . . . . . . . . . 15

    4.8 Maximum entropy posterior distributions . . . . . . . . . . 16

    4.8.1 Sequential processing for three states . . . . . . . . . 17

    1 A table summarizing some maximum entropy probability assignments is found at

    the end of the chapter (see Park and Bera [2009] for additional maxent assignments).

  • 8/9/2019 Maximum Entropy Distributions

    2/30

    ii Contents

    4.8.2 Another frame: sequential maximum relative entropy 204.8.3 Simultaneous processing of moment and data condi-

    tions . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    4.9 Convergence or divergence in beliefs . . . . . . . . . . . . . 234.9.1 diverse beliefs . . . . . . . . . . . . . . . . . . . . . . 234.9.2 complete ignorance . . . . . . . . . . . . . . . . . . . 244.9.3 convergence in beliefs . . . . . . . . . . . . . . . . . 24

    4.10 Appendix: summary of maximum entropy probability as-signments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

  • 8/9/2019 Maximum Entropy Distributions

    3/30

    This is page 1Printer: Opaque this

    4

    Maximum entropy distributions

    Bayesian analysis illustrates scientific reasoning where consistent reasoning(in the sense that two individuals with the same background knowledge,evidence, and perceived veracity of the evidence reach the same conclusion)is fundamental. In the next few pages we survey foundational ideas: Bayestheorem (and its product and sum rules), maximum entropy probabilityassignment, and consistent updating with maximum entropy probabilityassignments (posterior probability assignment).

    4.1 Bayes theorem and consistent reasoning

    Consistency is the hallmark of scientific reasoning. When we consider prob-ability assignment to events, whether they are marginal, conditional, or

    joint events their assignments should be mutually consistent (match upwith common sense). This is what Bayes product and sum rules expressformally.

    The product rulesays the product of a conditional likelihood (or distri-bution) and the marginal likelihood (or distribution) of the conditioningvariable equals the joint likelihood (or distribution).

    p (x, y) = p (x|y)p (y)

    = p (y|x)p (x)

    Thesum rulesays if we sum over all events related to one (set of) vari-able(s) (integrate out one variable or set of variables), we are left with the

  • 8/9/2019 Maximum Entropy Distributions

    4/30

    2 4. Maximum entropy distributions

    likelihood (or distribution) of the remaining (set of) variables(s).

    p (x) =

    n!i=1

    p (x, yi)

    p (y) =n!

    j=1

    p(xj, y)

    Bayes theorem combines these ideas to describe consistent evaluation ofevidence. That is, the posterior likelihood associated with a proposition, ! ,given the evidence, y , is equal to the product of the conditional likelihoodof the evidence given the proposition and marginal or prior likelihood ofthe conditioning variable (the proposition) scaled by the likelihood of theevidence. Notice, weve simply rewritten the product rule where both sidesare divided by p (y) and p (y) is simply the sum rule where ! is integratedout of the joint distribution, p (!, y).

    p (!| y) = p (y| !)p (!)

    p (y)

    For Bayesian analyses, we often find it convenient to suppress the normal-izing factor, p (y), and write the posterior distribution is proportional tothe product of the sampling distribution or likelihood function and priordistribution.

    p (! | y) ! p (y| !)p (!)or for a particular draw y = y0

    p (!| y = y0) ! " (!| y = y0)p (!)

    where p (y| !) is the sampling distribution, " (!| y = y0) is the likelihoodfunction evaluated aty = y0, andp (!)is the prior distribution for !. Bayestheorem is the glue that holds consistent probability assignment together.

    Example 1 Consider the following joint distribution:

    p (y= y1, != !1) p (y = y2, ! = !1) p (y= y1, != !2) p (y= y2, != !2)0.1 0.4 0.2 0.3

    The sum rule yields the following marginal distributions:

    p (y= y1) p (y = y2)0.3 0.7

    andp (! = !1) p (! = !2)

    0.5 0.5

  • 8/9/2019 Maximum Entropy Distributions

    5/30

    4.2 Maximum entropy distributions 3

    The product rule gives the conditional distributions:

    p (y| ! = !1) p (y| ! = !2)

    y1 0.2 0.4y2 0.8 0.6

    and

    p (!| y = y1) p (! | y = y2)

    !113

    47

    !223

    37

    as common sense dictates.

    4.2 Maximum entropy distributionsFrom the above, we see that evaluation of propositions given evidence isentirely determined by the sampling distribution, p (y| !), or likelihoodfunction," (!| y), and the prior distribution for the proposition,p (!). Con-sequently, assignment of these probabilities is a matter of some considerableimport. How do we proceed? Jaynes suggests we take account of our back-ground knowledge,", and evaluate the evidence in a manner consistentwith both background knowledge and evidence. That is, the posterior like-lihood (or distribution) is more aptly represented by

    p (!| y,") ! p (y| !,")p (! | ")

    Now, were looking for a mathematical statement of what we know andonly what we know. For this idea to be properly grounded requires a senseof complete ignorance (even though this may never represent our state ofbackground knowledge). For instance, if we think that 1 is more likely themean or expected value than 2 then we must not be completely ignorantabout the location of the random variable and consistency demands thatour probability assignment reflect this knowledge. Further, if the order ofevents or outcomes is notexchangeable(if one permutation is more plausiblethan another), then the events are not seen as stochastically independentor identically distributed.1 The mathematical statement of our backgroundknowledge is defined in terms of Shannons entropy (or sense of di!usionor uncertainty).

    1 Exchangeability is foundational for independent and identically distributed events

    (iid), a cornerstone of inference. However, exchangeability is often invoked in a con-ditional sense. That is, conditional on a set of variables exchangeability applies a

    foundational idea of conditional expectations or regression.

  • 8/9/2019 Maximum Entropy Distributions

    6/30

    4 4. Maximum entropy distributions

    4.3 Entropy

    Shannon defines entropy as2

    h= #n!i=1

    pi $ log(pi)

    for discrete events wheren!i=1

    pi= 1

    or di!erential entropy as3

    h= #" p (x)logp (x) dxfor events with continuous support where

    " p (x) dx= 1

    Shannon derived a measure of entropy so that five conditions are satisfied:(1) a measure h exists, (2) the measure is smooth, (3) the measure ismonotonically increasing in uncertainty, (4) the measure is consistent inthe sense that if di!erent measures exist they lead to the same conclusions,and (5) the measure is additive. Additivity says joint entropy equals the

    entropy of the signals (y) plus the probability weighted average of entropyconditional on the signals.

    h (x, y) = H(y) + Pr(y1)H(x|y1) + +P r(yn)H(x|yn)

    This latter term,Pr(y1)H(x|y1)+ +Pr(yn)H(x|yn), is called conditionalentropy. Internal logical consistency of entropy is maintained primarily viaadditivity the analog to the workings of Bayes theorem for probabilities.

    2 The axiomatic development for this measure of entropy can be found in Jaynes[2003] or Accounting and Causal E!ects: Econometric Challenges, ch. 13.

    3 Jaynes [2003] argues that Shannons di!erential entropy logically includes an invari-ance measure m (x) such that di!erential entropy is defined as

    h= !

    ! p (x)log

    p (x)

    m (x)dx

  • 8/9/2019 Maximum Entropy Distributions

    7/30

    4.4 Discrete maximum entropy examples6 5

    4.4 Discrete maximum entropy examples4

    4.4.1 Discrete uniform

    Example 2 Suppose we know only that there are three possible (exchange-able) events, x1 = 1, x2 = 2,andx3 = 3. The maximum entropy probabilityassignment is found by solving the Lagrangian

    L % maxpi

    ##

    3!i=1

    pilogpi # (#0 # 1)$

    3!i=1

    pi # 1%&

    First order conditions yield

    pi= e!!0 for i= 1, 2, 3

    and#0 = log 3

    Hence, as expected, the maximum entropy probability assignment is a dis-crete uniform distribution withpi =

    13 for i= 1, 2, 3.

    4.4.2 Discrete nonuniform

    Example 3 Now suppose we know a little more. We know the mean is2.5.5 The Lagrangian is now

    L % maxpi

    ##

    3!i=1

    pilogpi # (#0 # 1)$

    3!i=1

    pi # 1%# #1$

    3!i=1

    pixi # 2.5%&

    First order conditions yield

    pi= e!!0!xi!1 for i= 1, 2, 3

    and

    #0 = 2.987

    #1 = #0.834The maximum entropy probability assignment is

    p1 = 0.116

    p2 = 0.268

    p3 = 0.616

    4 A table summarizing some maximum entropy probability assignments is found at

    the end of the chapter (see Park and Bera [2009] for additional maxent assignments).5 Clearly, if we knew the mean is 2 then we would assign the uniform discrete distri-

    bution above.

  • 8/9/2019 Maximum Entropy Distributions

    8/30

    6 4. Maximum entropy distributions

    4.5 Normalization and partition functions

    The above analysis suggests a general approach for assigning probabilities

    where normalization is absorbed into a denominator. Since

    exp[##0]n!

    k=1

    exp

    '(# m!

    j=1

    #jfj(xi)

    )*= 1,

    p (xi) =exp[##0]exp+#,mj=1 #jfj(xi)-

    1

    =exp[##0]exp+#,mj=1 #jfj(xi)-

    exp[##0],nk=1exp +

    #,

    mj=1 #jfj(xi)-

    =exp+#,mj=1 #jfj(xi)-,n

    k=1exp+#,mj=1 #jfj(xi)-

    = k (xi)

    Z(#1, . . . ,#m)

    where fj(xi) is a function of the random variable, xi, reflecting what weknow,7

    k (xi) = exp

    '(# m!

    j=1

    #jfj(xi)

    )*

    is a kernel, and

    Z(#1, . . . ,#m) =n!

    k=1

    exp

    '(# m!

    j=1

    #jfj(xk)

    )*

    is a normalizing factor, called a partition function.8 Probability assignmentis completed by determining the Lagrange multipliers, #j , j = 1, . . . , m,from the m constraints which are function of the random variables.

    7 Since !0 simply ensures the probabilities sum to unity and the partition functionassures this, we can define the partition function without !0. That is, !0 cancels as

    demonstrated above.8 In physical statistical mechanics, the partition function describes the partitioning

    among di!erent microstates and serves as a generator function for all manner of results

    regarding a process. The notation, Z, refers to the German word for sum over states,zustandssumme. An example with relevance for our purposes is

    !"log Z

    "!1=

  • 8/9/2019 Maximum Entropy Distributions

    9/30

    4.5 Normalization and partition functions 7

    Return to the example above. Since we know support and the mean,n= 3 and the function f(xi) = xi. This implies

    Z(#1) =3!

    i=1

    exp[##1xi]

    and

    pi = k (xi)

    Z(#1)

    = exp[##1xi],3

    k=1exp [##1xk]

    where x1 = 1, x2 = 2, and x3 = 3. Now, solving the constraint

    3!i=1

    pixi # 2.5 = 0

    3!i=1

    exp[##1xi],3k=1exp [##1xk]

    xi # 2.5 = 0

    produces the multiplier, #1 = #0.834, and identifies the probability assign-ments

    p1 = 0.116

    p2 = 0.268

    p3 = 0.616

    We utilize the analog to the above partition function approach next toaddress continuous density assignment as well as a special case involvingbinary (Bernoulli) probability assignment which takes the shape of a logisticdistribution.

    4.5.1 Special case: Logistic shape, Bernoulli distribution

    Example 4 Suppose we know a binary (0, 1) variable has mean equal to35 . The maximum entropy probability assignment by the partition function

    where is the mean of the distribution. For the example below, we find

    !"log Z

    "!1=

    3 + 2e!1 + e2!1

    1 + e!1 + e2!1= 2.5

    Solving gives !1 = !0.834 consistent with the approach below.

  • 8/9/2019 Maximum Entropy Distributions

    10/30

    8 4. Maximum entropy distributions

    approach is

    p (x) = exp[## (1)]

    exp[## (0)] + exp [## (1)]=

    exp[##]1 + exp [##]

    = 1

    1 + exp [#]

    This is the shape of the density for a logistic distribution (and would be alogistic density if support were unbounded rather than binary).9 Solving for# = log 23 or p (x= 1) =

    35 , reveals the assigned probability of success or

    characteristic parameter for a Bernoulli distribution.

    4.6 Combinatoric maximum entropy examples

    Combinatorics, the number of exchangeable ways events occur, plays a keyrole in discrete (countable) probability assignment. The binomial distribu-tion is a fundamental building block.

    4.6.1 Binomial

    Suppose we know there are binary (Bernoulli or "success"/"failure") out-comes associated with each of n draws and the expected value of successequals np. The expected value of failure is redundant, hence there is onlyone moment condition. Then, the maximum entropy assignment includesthe number of combinations which produce x1 "successes" and x0 "fail-

    ures". Of course, this is the binomial operator,. nx1,x0/

    = n!

    x1!x0! wherex1 + x0 = n. Let mi%.

    nx1i,x0i

    / and generalize the entropy measure to

    account for m,

    s= #n,

    i=1pilog

    0pi

    mi

    1Now, the kernel (drawn from the Lagrangian) is

    ki= miexp [##1xi]Satisfying the moment condition yields the maximum entropy or binomialprobability assignment.

    p (x, n) = ki

    Z

    = . n

    x1i,x0i/px1 (1#p)x0 , x1+x0 = n

    0 otherwise

    9 This suggests logistic regression is a natural (as well as the most common) strategy

    for modeling discrete choice.

  • 8/9/2019 Maximum Entropy Distributions

    11/30

    4.6 Combinatoric maximum entropy examples 9

    Example 5 (balanced coin) Suppose we assignx1 = 1for each coin flipresulting in a head, x0 = 1 for a tail, and the expected value is E[x1] =E[x0] = 6 in 12 coin flips. We know there are 2

    12 = ,x1+x0=12. 12x1,x0/ =

    4, 096 combinations of heads and tails. Solving #1 = 0, the maximum en-tropy probability assignment associated with x equal x1 heads and x0 tailsin12 coin flips is

    p (x, n= 12) =

    .12x1

    /12

    x112

    x0, x1+x0 = 12

    0 otherwise

    Example 6 (unbalanced coin) Continue the coin flip example above ex-cept heads are twice as likely as tails. In other words, the expected valuesare E[x1] = 8 and E[x0] = 4 in 12 coin flips. Solving #1 =# log2, themaximum entropy probability assignment associated withs heads in12 coin

    flips is

    p (x, n= 12) =

    .12x1

    /23

    x113

    x0, x1+x0 = 12

    0 otherwise

    p (x, n= 12)[x1, x0] balanced coin unbalanced coin

    [0, 12] 14,0961

    531,441

    [1, 11] 124,09624

    531,441

    [2, 10] 664,096

    264531,441

    [3, 9] 220

    4,0961,760

    531,441

    [4, 8] 4954,0967,920

    531,441

    [5, 7] 7924,09625,344531,441

    [6, 6] 9244,09659,136531,441

    [7, 5] 7924,096101,376531,441

    [8, 4] 4954,096

    126,720531,441

    [9, 3] 2204,096112,640531,441

    [10, 2] 664,09667,584531,441

    [11, 1] 124,096 24,576531,441

    [12, 0] 14,0964,096

    531,441

  • 8/9/2019 Maximum Entropy Distributions

    12/30

    10 4. Maximum entropy distributions

    Canonical analysis

    The above is consistent with a canonical analysis based on relative entropy,ipilog2 pipold,i3

    wherepold,ireflects probabilities assigned based purely on

    the number of exchangeable ways of generatingxi; hence,pold,i= (nxi)!

    i (nxi

    )=

    mi!imi

    . Since the denominator ofpold,i is absorbed via normalization it can

    be dropped, then entropy reduces to,

    ipilog2pimi

    3and the kernel is the

    same as aboveki = miexp [##1x1i]

    4.6.2 Multinomial

    The multinomial is the multivariate analog to the binomial accommodatingk rather than two nominal outcomes. Like the binomial, the sum of the

    outcomes equalsn,,

    ki=1xi = n where xk = 1for each occurrence of eventk. We know there are

    kn =!

    x1++xk=n

    0 n

    x1, , xk

    1=!

    x1++xk=n

    n!

    x1! xk!

    possible combinations of k outcomes, x = [x1, , xk], in n trials. Fromhere, probability assignment follows in analogous fashion to that for thebinomial case. That is, knowledge of the expected values associated withthe k events leads to the multinomial distribution when the n!x1!xk! ex-

    changeable ways for each occurrence x is taken into account, E[xj ] =npjforj = 1, . . . , k#1. Onlyk#1moment conditions are employed as the kthmoment is a linear combination of the others, E[xk] = n

    #,k!1j=1E[xj ] =

    n2

    1#,k!1j=1pj3.The kernel is

    n!

    x1! xk!exp [##1x1 # # #k!1xk!1]

    which leads to the standard multinomial probability assignment when themoment conditions are resolved

    p (x, n) = p (x, n) = n!x1!xk!p

    x11 p

    xkk ,,k

    i=1xi = n,,k

    i=1pi = 1

    0 otherwise

    Example 7 (one balanced die) Suppose we roll a balanced die (k = 6)one time (n = 1), the moment conditions are E[x1] = E[x6] =

    16 .

    Incorporating the number of exchangeable ways to generate n = 1 results,the kernel is

    1!

    x1! x6!exp [##1x1 # # #5x5]

  • 8/9/2019 Maximum Entropy Distributions

    13/30

  • 8/9/2019 Maximum Entropy Distributions

    14/30

    12 4. Maximum entropy distributions

    p (x, n= 2)[x1, x2, x3, x4, x5, x6] balanced dice unbalanced dice

    [2, 0, 0, 0, 0, 0] 136

    1

    441

    [0, 2, 0, 0, 0, 0] 1364

    441

    [0, 0, 2, 0, 0, 0] 136

    9441

    [0, 0, 0, 2, 0, 0] 13616441

    [0, 0, 0, 0, 2, 0] 13625441

    [0, 0, 0, 0, 0, 2] 13636441

    [1, 1, 0, 0, 0, 0] 1184

    441

    [1, 0, 1, 0, 0, 0] 1186

    441

    [1, 0, 0, 1, 0, 0] 118

    8441

    [1, 0, 0, 0, 1, 0] 118 10441

    [1, 0, 0, 0, 0, 1] 11812441

    [0, 1, 1, 0, 0, 0] 11812441

    [0, 1, 0, 1, 0, 0] 11816441

    [0, 1, 0, 0, 1, 0] 11820441

    [0, 1, 0, 0, 0, 1] 11824441

    [0, 0, 1, 1, 0, 0] 11824441

    [0, 0, 1, 0, 1, 0] 118

    30441

    [0, 0, 1, 0, 0, 1] 118

    36

    441

    [0, 0, 0, 1, 1, 0] 118

    40441

    [0, 0, 0, 1, 0, 1] 11848441

    [0, 0, 0, 0, 1, 1] 11860441

    4.6.3 Hypergeometric

    The hypergeometric probability assignment is a purely combinatorics exer-cise. Suppose we taken drawswithoutreplacement from a finite populationofN items of which m are the target items (or events) and x is the num-ber of target items drawn. There are .

    mx/ ways to draw the targets times.N!mn!x / ways to draw nontargets out of.Nn/ ways to make n draws. Hence,

    our combinatoric measure is

    (mx)(N!mn!x )

    (Nn) , x & {max (0, m+n #N) , min(m, n)}

  • 8/9/2019 Maximum Entropy Distributions

    15/30

    4.7 Continuous maximum entropy examples 13

    andmin(m,n)

    !x=max(0,m+n!N).mx

    /.N!mn!x

    /.Nn/ = 1

    which completes the probability assignment as there is no scope for momentconditions or maximizing entropy.10

    Example 11 (hypergeometric) Suppose we have an inventory of sixitems (N = 6) of which one is red, two are blue, and three are yellow.We wish to determine the likelihood of drawingx= 0, 1, or2 blue items intwo draws without replacement (m= 2, n= 2).

    p (x= 0, n= 2) =

    .20

    /.6!22!0/.62

    /=

    4

    6

    3

    5=

    6

    15

    p (x= 1, n= 2) =

    .21

    /.6!22!1/.62

    /=

    2

    6

    4

    5+

    4

    6

    2

    5=

    8

    15

    p (x= 2, n= 2) =

    .22

    /.6!22!2/.62

    /=

    2

    6

    1

    5=

    1

    15

    4.7 Continuous maximum entropy examples

    The partition function approach for continuous support involves densityassignment

    p (x) =exp+#,mj=1 #jfj(x)-4b

    aexp+#,mj=1 #jfj(x)- dx

    10 If we omit"Nn

    #in the denominator, it would be captured via normalization. In other

    words, the kernel is

    k=$mx

    %$N! mn! x

    %exp[0]

    and the partition function is

    Z=

    min(m,n)&x=max(0,m+n!N)

    k=$Nn

    %

  • 8/9/2019 Maximum Entropy Distributions

    16/30

    14 4. Maximum entropy distributions

    where support is between aand b.

    4.7.1 Continuous uniformExample 12 Suppose we only know support is between zero and three.The above partition function density assignment is simply (there are noconstraints so there are no multipliers to identify)

    p (x) = exp[0]430

    exp[0] dx=

    1

    3

    Of course, this is the density function for a uniform with support from0 to3.

    4.7.2 Known mean

    Example 13 Continue the example above but with known mean equal to1.35. The partition function density assignment is

    p (x) = exp[##1x]430

    exp[##1x] dx

    and the mean constraint is " 30

    xp (x) dx # 1.35 = 0

    " 3

    0

    x exp[##1x]430

    exp[##1

    x] dxdx # 1.35 = 0

    so that #1 = 0.2012, Z =430

    exp[##1x] dx = 2.25225, and the densityfunction is a truncated exponential distribution with support from0 to 3.

    p (x) = 0.444exp[#0.2012x] , 0 ' x ' 3

    The base (non-truncated) distribution is exponential with mean approxi-mately equal to 5 (4.9699).

    p (x) = 14.9699 exp5# x4.9699 6 , 0 ' x < (

    p (x) = 0.2012exp[#0.2012x]

    4.7.3 Gaussian (normal) distribution and known variance

    Example 14 Suppose we know the average dispersion or variance is$2 =100. Then, a finite mean must exist, but even if we dont know it, we can

  • 8/9/2019 Maximum Entropy Distributions

    17/30

    4.7 Continuous maximum entropy examples 15

    find the maximum entropy density function for arbitrary mean, . Usingthe partition function approach above we have

    p (x) =exp+##2(x# )2-4"

    !" exp+##2(x # )2-

    dx

    and the average dispersion constraint is""!"

    (x# )2p (x) dx# 100 = 0

    ""!"

    (x # )2exp+##2(x # )2-

    4"!" exp+##2(x# )2-

    dx# 100 = 0

    so that#2 = 12"2 and the density function is

    p (x) = 1)

    2%$exp

    ##(x# )

    2

    2$2

    &

    = 1)

    2%10exp

    ##(x# )

    2

    200

    &

    Of course, this is a Gaussian or normal density function. Strikingly, theGaussian distribution has greatest entropy of any probability assignmentwith the same variance.

    4.7.4 Lognormal distribution

    Example 15 Suppose we know the random variable has positive support(x >0) then it is natural to work with (natural) logarithms. If we also know

    E[log x] = = 1 as well as E+

    (log x)2-

    = $2 = 10, then the maximum

    entropy probability assignment is the lognormal distribution

    p (x) = 1x#2#"

    exp+# (logx!)22"2-

    , 0< x,,$ < (

    Again, we utilize the partition function approach to demonstrate.

    p (x) =exp+##1log x# #2(log x)2-

    4"0

    exp

    +##1log x# #2(log x)2-

    dx

    and the constraints are

    E[log x] =

    ""0

    log xp (x) dx# 1 = 0

  • 8/9/2019 Maximum Entropy Distributions

    18/30

    16 4. Maximum entropy distributions

    and

    E

    +(log x)2

    -=

    ""0

    (log x)2p (x) dx# 10 = 0

    so that#1 = 0.9 and#2 = 12"2 = 0.05. Substitution gives

    p (x) ! exp7#2(9)

    20 log x # 1

    20(log x)

    2

    8

    Completing the square and adding in the constant (from normalization)exp5# 1206 gives

    p (x) ! exp[# log x]exp7# 1

    20+

    2

    20log x# 1

    20(log x)28

    Rewriting produces

    p (x) ! exp 5log x!16 exp##(log x# 1)220

    &

    which simplifies as

    p (x) ! x!1 exp##(log x# 1)

    2

    20

    &

    Including the normalizing constants yields the probability assignment as-serted above

    p (x) = 1x#2#10

    exp+# (logx!1)22(10)-

    , 0< x < (

    4.8 Maximum entropy posterior distributions

    We employ maximum entropy to choose among a priori probability distribu-tions subject to our knowledge of moments (of functions) of the distribution(e.g., the mean). When new evidence is collected, were typically interestedin how this data impacts our posterior beliefs regarding the parameterspace. Of course, Bayes theorem provides guidance. For some problems,we simply combine our maximum entropy prior distribution with the like-lihood function to determine the posterior distribution. Equivalently, wecan find (again, by Lagrangian methods) the maximum relative entropyposterior distribution conditional first on the moment conditions then con-

    ditional on the data so that the data eventually outweighs the momentconditions.

    When sequential or simultaneous processing of information produces thesame inferences we say the constraints commute, as in standard Bayesian

  • 8/9/2019 Maximum Entropy Distributions

    19/30

    4.8 Maximum entropy posterior distributions 17

    updating. However, when one order of the constraints reflects di!erentinformation (addresses di!erent questions) than a permutation, the con-straints are noncommuting. For noncommuting constraint problems, con-

    sistent reasoning demands we modify our approach from the standard oneabove (Gi"n and Caticha [2007]). Moment constraints along with dataconstraints are typically noncommuting. Therefore, we augment standardBayesian analysis with a "canonical" factor. We illustrate the di!erencebetween sequential and simultaneous processing of moment and data con-ditions via a simple three state example.

    4.8.1 Sequential processing for three states

    Suppose we have a three state process (!1, !2, !3) where, for simplicity,

    !i& {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8}and

    3!i=1

    !i= 1

    Well refer to this a die generating process where !1 refers to outcome oneor two, !2 to outcome three or four, and !3 corresponds to die face five orsix. The maximum entropy prior with no moment conditions is

    pold(!1, !2, !3) = 1,8j=1j

    = 1

    36

    for all valid combinations of(!1, !2, !3)and the likelihood function is multino-mial

    p (x| !) =(n1+n2+n3)!

    n1!n2!n3! !n11 !

    n22 !

    n33

    where ni represents the number of i draws in the sample ofn1+ n2+ n3total draws.

    Now, suppose we know, on average, state one (!1) is twice as likely asstate three (!3). That is, the process produces "die" with these averageproperties. The maximum entropy prior given this moment condition isgenerated from solving

    maxp($)

    h= #,$p (!1, !2, !3)logp (!1, !2, !3)s.t.,

    $p (!1, !2, !3) (!1 # 2!3) = 0Lagrangian methods yield11

    p (!) = exp[#

    (!1 # 2

    !3)]Z

    11 Frequently, we write p (#) in place ofp (#1, #2, #3) to conserve space.

  • 8/9/2019 Maximum Entropy Distributions

    20/30

    18 4. Maximum entropy distributions

    where the partition function is

    Z=!$

    exp[# (!1

    #2!3)]

    and # is the solution to

    #&log Z

    = 0

    In other words,

    p (!) =exp[1.44756 (!1 # 2!3)]

    28.7313

    and

    p (x, !) = p (x| !)p (!)

    = (n1+n2+n3)!

    n1!n2!n3! !n11 !

    n22 !

    n33

    exp [1.44756 (!1 # 2!3)]28.7313

    Hence, prior to collection and evaluation of evidence expected values of!are

    E[!1] =!x

    !$

    !1p (x, !)

    =!$

    !1p (!) = 0.444554

    E[!2] =!x

    !$

    !2p (x, !)

    =!$

    !2p (!) = 0.333169

    E[!3] =!x

    !$

    !3p (x, !)

    =!$

    !3p (!) = 0.222277

    whereE[!1+ !2+ !3] = 1andE[!1] = 2E[!3]. In other words, probabilityassignment matches the known moment condition for the process.

    Next, we roll one "die" ten times to learn about this specific die and theoutcomex is m = {n1 = 5, n2 = 3, n3 = 2}. The joint probability is

    p (!, x= m) = 2520!51!32!

    23

    exp[1.44756(!1 # 2!3)]28.7313

    the probability of the data is

    p (x) =!$

    p (!, x= m) = 0.0286946

  • 8/9/2019 Maximum Entropy Distributions

    21/30

    4.8 Maximum entropy posterior distributions 19

    and the posterior probability of! is

    p (! | x = m) = 3056.65!51!32!

    23exp [1.44756 (!1 # 2!3)]

    Hence,

    E[!1 | x = m] =!$

    p (!| x = m) !1

    = 0.505373

    E[!2 | x = m] =!$

    p (!| x = m) !2

    = 0.302243

    E[!3 | x = m] =!$

    p (!| x = m) !3

    = 0.192384

    and

    E[!1+ !2+ !3 | x = m] = 1

    However, noticeE[!1 # 2!3 | x = m] *= 0. This is because our priors includ-ing the moment condition refer to the process and here were investigatinga specific "die".

  • 8/9/2019 Maximum Entropy Distributions

    22/30

    20 4. Maximum entropy distributions

    To help gauge the relative impact of priors and likelihoods on posteriorbeliefs we tabulate the above results and contrast with uninformed priorsas well as alternative evidence. It is important to bear in mind we subscribe

    to the notion that probability beliefs represent states of knowledge.

    background knowledge uninformed E[!1 # 2!3] = 0Pr (!) 136

    exp[1.44756($1!2$3)]28.7313

    E

    '( !1!2

    !3

    )*'( 131

    313

    )*'( 0.4445540.333169

    0.22277

    )*

    Pr

    9:!;;;;;

    '( n1 = 5n2 = 3

    n3 = 2

    )*0 exp [##x]f(x) =

    1exp+#x-

    ,

    0< x < (

    gammaE[x] = a >0,

    E[log x] = !#

    (a)!(a)

    exp[##1x # #2log x]f(x) =

    1!(a)exp [#x] xa!1,

    0< x 0,

    E[log x] = !#

    ( 12 )!( 12 )

    +log2

    exp[##1x # #2log x]f(x) = 1

    2$/2!(%/2)

    exp5#x2 6x%/2!1,

    0< x < (

    beta

    E[log x] =!#

    (a)![a] # !

    #

    (a+b)![a+b] ,

    E[log (1# x)] =!#

    (b)!(b)# !

    #

    (a+b)![a+b] ,

    a, b >0

    exp

    7 ##1log x##2log (1# x)8 f(x) = 1B(a,b)

    xa!1 (1# x)b!1 ,0< x 0 exp[## log x] f(x) =

    &x%mx%+1 ,

    0< xm' x < (

    Laplace E[|x|] = 1' ,

    )> 0 exp[## |x|] f(x) ='

    2 exp[#)|x|] ,#( ' x < (

    Wishartn d.f.

    E[tr (")] = ntr (#)n > p# 1

    " symmetric,positive definite

    E[log |"|] = log |#|+$ (n) , real

    exp

    7 ##1tr (")##2log |"|8 f(") =|"|!n2|#|n!p!12

    2np2 !(n2 )

    exp

    7# tr("

    !1#)

    2

    8