26
, . J | ELSEVIER Journal of Statistical Planning and Inference 49 (1996) 137-162 joumal of statistical " and infere~nrung Entropy, divergence and distance measures with econometric applications Aman Ullah Department of Economics, University of California, Riverside, CA 92521, USA Received 22 February 1993; revised 12 September 1993 Abstract This paper provides a unified treatment of various entropy, divergence and distance measures and explores their applications in the context of econometric estimation and hypothesis testing. AMS Subject Classifications: 62H30, 94A17, 53A35 Keywords: Divergence measures; Entropy; Distance; Hypothesis testing; Esimation 1. Introduction The importance of suitable measures of distance between probability distributions arises because of the role they play in the problems of inference and discrimination. The concept of distance between two probability distributions was initially developed by Mahalanobis (1936). Since then various types of distance measures have been de- veloped in the literature, see Burbea and Rao (1982) and Rao (1982). Many of the currently used econometric tests, such as the likelihood ratio, the score and Wald tests, can in fact be shown to be in terms of appropriate distance measures. A concept closely related to the one of distance measures is that of divergence measures based on the idea of information-theoretic entropy first introduced in communication theory by Shannon (1948) and later by Wiener (1949) in Cybernetics. The origin of the term entropy, however, goes back to the work of Clausius (1864) and Boltzman (1872) in thermodynamics. Although it is well-known that the two concepts are related, here ;:: In completing this paper, the author is thankful to two referees, M. Behara, D.V. Gokhale, G. Judge, S.N.U.A. Kirmani, J.N. Kapur, E. Maasoumi, C.R. Rao, A. Zellner and V. Zinde-Walsh for providing useful references and constructive comments. The financial support from the Academic Senate, UCR, is gratefully acknowledged. 0378-3758/96/$15.00 (~) 1996~Elsevier Science B.V. All rights reserved SSDI 0378-3758(95)00034-8

Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

, . J

| ELSEVIER

Journal of Statistical Planning and Inference 49 (1996) 137-162

joumal of statistical " and infere~ nrung

Entropy, divergence and distance measures with econometric applications

A m a n Ul l ah Department of Economics, University of California, Riverside, CA 92521, USA

Received 22 February 1993; revised 12 September 1993

Abstract

This paper provides a unified treatment of various entropy, divergence and distance measures and explores their applications in the context of econometric estimation and hypothesis testing.

AMS Subject Classifications: 62H30, 94A17, 53A35

Keywords: Divergence measures; Entropy; Distance; Hypothesis testing; Esimation

1. Introduction

The importance of suitable measures of distance between probability distributions arises because of the role they play in the problems of inference and discrimination.

The concept of distance between two probability distributions was initially developed

by Mahalanobis (1936). Since then various types of distance measures have been de- veloped in the literature, see Burbea and Rao (1982) and Rao (1982). Many of the currently used econometric tests, such as the likelihood ratio, the score and Wald tests, can in fact be shown to be in terms of appropriate distance measures. A concept closely related to the one of distance measures is that of divergence measures based on the idea of information-theoretic entropy first introduced in communication theory by Shannon (1948) and later by Wiener (1949) in Cybernetics. The origin of the term entropy, however, goes back to the work of Clausius (1864) and Boltzman (1872) in thermodynamics. Although it is well-known that the two concepts are related, here

;:: In completing this paper, the author is thankful to two referees, M. Behara, D.V. Gokhale, G. Judge, S.N.U.A. Kirmani, J.N. Kapur, E. Maasoumi, C.R. Rao, A. Zellner and V. Zinde-Walsh for providing useful references and constructive comments. The financial support from the Academic Senate, UCR, is gratefully acknowledged.

0378-3758/96/$15.00 (~) 1996~Elsevier Science B.V. All rights reserved SSDI 0 3 7 8 - 3 7 5 8 ( 9 5 ) 0 0 0 3 4 - 8

Page 2: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

138 A. Ullah/ Journal of Statistical Plannin9 and Inference 49 (1996) 137-162

we consider Shannon's concept of information-theoretic entropy and its generalization known as the Kullback and Leibler (1951 ) relative entropy or the divergence measure between two probability distributions. Jaynes (1957) introduced the maximum entropy (Maxent) principle which determines the probability distribution of a random variable/ vector by maximizing the Shannon entropy, subject to certain moment conditions. His optimization principle is the same as the Kullback principle of minimizing the Kullback-Leibler relative entropy when one of the distributions is uniform. Jaynes' Maxent principle was a turning point in the use of Shannon's entropy as a method of statistical inference, and since then an increasing number of applications of this en- tropy have been made in various applied sciences, for example see the recent books by Kapur (1989), Kapur and Kesavan (1992), Buck and Macauley (1991), and Cover and Thomas (1991). In econometrics the early contributions were made by Davis (1941), Theil and his students, Tintner (1960), and recently by Zellner and his students, among others. While the seminal work of Theil (1967) used the entropy function to measure income inequality and industrial concentration, the work of Zellner (1991) provided an excellent synthesis of Bayesian and entropy concepts in model building and inference. However, perhaps it is fair to say that up to now no systematic development of entropy based econometric inference (theory and applications) has been done.

In recent years, various extensions of the Shannon entropy and the Kullback-Leibler divergence measure have been done, see e.g. Burbea and Rao (1982) and Kapur and Kesavan (1992). At present however most of these extensions are purely mathematical concepts to define measures of distance between two probability distributions, and not much is known about their interrelationships, their inferential properties and their ap- plications in statistics and econometrics. There are three main objectives of this paper. First is essentially to provide a unified review of the developments in distance, di- vergence and entropy measures in the continuous variables case. This is by no means claimed to be complete. Second, we observe that the current literature on entropy has focused mainly on Shannon entropy and Kullback-Leibler relative entropy (diver- gence), perhaps because of their simplicity. In view of this we explore the motivations and implications of using various generalized classes of entropy measures reviewed in this paper. It has been indicated that the use of different entropy measures may lead to different models or statistical results than those obtained by Shannon and Kullback- Leibler measures. This has been demonstrated throughout the paper by considering Havrda and Charvat (1967) r-class of entropy measures. The main thrust of the paper is centred around this point and it indicates the need of developing selection rules for entropy/divergence measures in any given problem, also see the important work by Csisz~r (1991). The third objective is to analyse the application of the generalized class of entropy and divergence measures, especially r-class entropies, in the con- texts of econometric estimation and hypothesis testing. This generalizes the results for Shannon and Kullbak-Leibler measures. We note here that Sections 2 and 3 of this paper are the extended and improved version of Ullah (1983) where the initial thoughts on entropy and divergence measures were presented. The present paper also gives more general results with regard to the applications of r-class entropies in econometrics.

Page 3: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

A. Ullah/Journal of Stat&tical Plann&o and Inference 49 (1996) 137-162 139

We hope that the above objectives will lead to future work in entropy based econo-

metrics, and at least provide alternative procedures for econometrics inference. In ap-

plications, the progress will depend on the implementation of the Maxent principle for the multivariate cases where significant advances have been made by kernel density based procedures, see Hardle (1990).

The plan of the paper is as follows. In Section 2 we provide the definitions and concepts related to distance and the general class of entropy measures. Some statistical results based on the /~-class entropy are also analysed. Then, in Section 3, we present various classes of directed and non-directional divergence measures and indicate those non-directional divergence measures which are also distance measures. Some econo-

metric results based on the/~-class divergence measures are also analysed. In Section 4 we present the entropy based optimizing principles and explore their applications in

the estimation and hypothesis testing problems. Finally, in Section 5 we review other applications and present our conclusion.

2. Distance and entropy measures

In this section we present some definitions and concepts related to distance and

entropy. These will be useful for developing the distance measures between probability distributions (populations) in Section 3.

2.1. D&tance

A measure d ( x , y ) between two points x , y is said to be a distance measure or distance if

(i) d ( x , y ) > 0 when x ~ y and d ( x , y ) - - 0 if and only i f x = y, (ii) d ( x , y ) = d (y , x ) , (2.1)

(iii) d(x, y ) + d(x ,z )>~d(y ,z ) .

The conditions (i)-(iii) imply, respectively, that the distance must be non-negative, symmetric and that the distance from point x to point y directly must be less than or equal to the distance in reaching point y indirectly through z. Note that distance d(x, y ) is also called metric.

The definition of metric space follows from the definition of distance or metric, i.e. a space X is said to be a metric space if for every pair of points x, y in X there is defined a distance d ( x , y ) satisfying (i)-(iii) in (2.1).

Let us consider two non-stochastic column vectors x = ( x 1 . . . . . Xn) ! and

Y : (Yl . . . . . Yn)'. Then the Minkowsk i distance or Lp norm between x and y is given by

Lp = d(p)(X, y ) = (xi - y i) p p >~ 1

--[[x - yH p. (2.2)

Page 4: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

140 A. UUah/Journal of Statistical Planning and Inference 49 (1996) 137-162

For p = 2 it reduces to

d(z)(x ,y) = xi - yi) 2

o r

= IIx - y l l 2

n

d~2)(x ' y ) = Z ( x i _ yi)2 =- (x - y f ( x - y ) i= [

= xlx + y ' y - 2x ' y

(2.3)

which is the Euclidian distance. An extension is the Euclidian or Hilbert distance in

an infinite dimension seperable linear (vector) space. The Hilbert distance is the same

as (2.3) where n is replaced by ec. This will be denoted as

d(2)(x, y ) = dn(x, y ) = [I x -- Yll 2. (2.4)

When p tends to infinity, do~(x,y) = maxl<~j~n(Xj - yj) . For p < 1, (2.2) does

not satisfy (2.1) and hence it will not be a distance measure.

We point out that in the finite or infinite dimensional vector space d(p)(X, y ) satisfy

the property of ' translation invariance' in the sense that d(p)(X, y ) = d(p)(x + b, y + b)

for any real constant b. The Euclidian distance also satisfies 'homogenei ty ' , i.e, i f x

and y = ax are two points in the vector space with a a real constant, then the distance

of point y from the origin 0 is d ( O , y ) = [a[d(O,x).

When the vectors y and x are stochastic, dp(x, y ) or Lp norm in (2.2) will be defined

a s

d(p)(X, y ) = IIx - yll p = (E(x - y)p)l/p. (2.5)

For p = 2, d(2)(x,y) = dH(x ,y ) = (E(x - y)2)1/2 = [[x - y[12 will be the Hilbert

Euclidean distance corresponding to (2.3).

In the following sections we will most ly be dealing with the cases where, for n = 1,

Xl = f l ( Y ) and Yl = f2 (Y) are the density functions with the domain o f y belong-

ing to N1. For these situations the d ( p ) ( f l ( y ) , f 2 ( y ) ) or the Lp norm is considered

a s

d ( p ) ( f l , f 2 ) = I I71 - f211 p = ( f l ( y ) - f z ( y ) ) p d y ) . (2.6)

For p = 1, we have Kolmogrov ' s distance.

2.2. General class o f entropy measures

Let us consider a random vector y = [ Y l . . . . . Yn]' with the multivariate probabil i ty

density function f ( y ) such that f y f ( y ) d y = 1, where the integral is taken over the

Page 5: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

A. Ullah/Journal of Statistical Plannin 0 and Inference 49 (1996) 137-162 141

entire space o f y. Then a measure o f information content from observations in f ( y ) is l o g ( f (y) ) -1 = - l o g f ( y ) and the expected information in y is given by

H(y) = H ( f ) = - E l o g f ( y ) = - f y l o g f ( y ) f ( y ) d y . (2.7)

This definition is due to Shannon (1948), and it is a measure of average information and

uncertainty or volatility. For the case where y is, say, a discrete scalar random variable,

H ( f ) = -Y '~p(yk) log P(Yk) where k = 1 . . . . . r and P(Yk) ---- P(Y = Yk). Note that when the event y --- Yk occurs with complete certainty, P(Yk) = 1 and - l o g P(Yk) = 0 which implies a lack of information. Similarly, when the event y = yk occurs with

P(Yk) tending to zero, - l o g P(Yk) tends to +oc which implies a high degree o f

information. It is in this sense that - l o g p, or in the continuous case, - l o g f , can

be considered as a measure of information content in a random vector y. The use

of notation H ( f ) emphasizes that the entropy depends on the probability density o f

y rather than its actual values. We note here that although entropy is extensively

developed in the discrete case, the concept goes through for the continuous case by

changing the summation to the Riemann integral observing the following points. First,

the entropy in the continuous case is defined if the density function of y is such that

the integral in (2.7) exists. Second, while in the discrete case the entropy is always

positive it may become negative for some continuous density functions. For example,

if n = 1 and f ( y ) = lib is a uniform density between 0 < y < b, then H ( f ) = log b which can be negative for b < 1. In this case Shannon's entropy H ( f ) in (2.7) cannot

represent a measure o f uncertainty since uncertainty is usually positive. However, the

difference between uncertainties of two probability distributions can be negative and so

the Kuilback-Leibler extension of (2.7) in Section 3 to a measure of relative entropy

or uncertainty is useful.

We note that Shannon's definition o f information (entropy) in communication theory

is the same as Wiener 's information (1948, p. 75) in his works on cybernetics although

their motivation was different. The origin o f information theory may however be dated

back to 1928 due to Hartley. Shannon's entropy is also mathematically equivalent to

the entropy equation in thermo dynamics, see Cover and Thomas (1991) and Kapur

and Kesavan (1992).

As noted above, Shannon's entropy is the expected value of the function 9 ( f ) = - l o g f , which satisfies 9(1) = 0 and 9(0) = oc. In general, one can choose any

convex function 9 ( f ) with the condition that O(1) = 0 as a measure o f information

content (Khinchin 1957). The expected information content is then given as

Ha(f) = E g ( f ) = fyg(f)f, dy, (2.8)

and we refer to this as a class of g-entropies. Axiom (restriction) systems have been

used in the literature to choose the information functions g. Some of these are presented below.

Page 6: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

142 A. Ullah/Journal of Statistical Plannino and Inference 49 (1996) 137 162

Let us consider a class of smooth functions 9 given by

{ (/7 - l ) - l (1 - fs~-l), / 7 7 1 , / 7 > 0 ,

gl3( f )= - l o g f , /7= 1, (2.9)

where /7 is a non-stochastic constant. Then we can write a class of entropy measures a s

H~(f ) = ( /7 - 1)- l [ 1 - E f ~ - l ] , / 75 1,/7 > 0, (2.10)

- E log f = H ( f ) , /7 = 1.

This will be called the fl-class entropy and it is given by Havrda and Charvat (1967). For fl = 1 we get Shannon's entropy as given in (2.7).

Another class of entropy measures are given by Renyi (1961) and it is

{ ( ~ 1) -I log :<-I - f y f f d y

H ~ ( f ) = ( ~ - l ) - i l o g E f x-1 ~ l , c ~ > 0 (2.11)

- E log f -- H ( f ) . ~ = 1.

Note that //-class entropy measures of Havrda and Charvat, a-class measures of Renyi, and Shannon's entropy measure differ with respect to the axioms on which they are based. While Shannon and Havrda and Charvat entropy measures satisfy the axioms (i) additive, (ii) continuous and symmetric, (iii) zero indifference, H ( f , 0 ) = H ( f ) , and (iv) branching, the entropy measure of Renyi satisfies only (i)-(iii). We note also that Havrda and Charvat entropy measures are monotonic functions of Renyi measures. For further details regarding axioms and properties of these measures see Havrda and Charvat (1967).

2.2.1. Conditional entropy Let n = 2 so that y = [yl,y2], f ( Y ) is the joint density and f ( Y i ) = f l and

f (Y2) = f2 are the marginal densities. Then we can write Shannon's entropy as

= - i f f (Yl )f(Y2IY, )[log f(Yi )+ log f(Y2IY' )1 dyl dy2

= H ( f i ) + H(f21fi) 2

= Z H ( f , l f i - I . . . . . f l ) (2.12) i --I

Page 7: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

A. UllahlJournal of Statistical Plannin9 and Inference 49 (1996) 137 162 143

where H(fl) is the entropy of Yl and

H(f2lfl)=-fyf(Y')[fy2f(YzlY')l°gf(Y2lyl)dy2 ]

= f f(yl )H(y2IYl )dy l

=Ey, H(f2]fl) (2.13)

is the conditional entropy of the random variable y2 given the random varaible yl, and it is the average of the entropies of the conditional distribution over all the values

of the conditioning variables. Note that

H(f l l f2) ¢ H(f2lf l ),

H(f l ) - H(f l Ifz ) = H(f 2) - H(f 2 If ~ ), (2.14)

H(f2lf~) - H(fz) <. O,

where the last inequality suggests that knowing about another random variable Yl can

reduce the average uncertainty about y2, H(f2[fl)~<H(f2), with equality holding if and only if Yl and Y2 are independent. Thus one can consider H(f2lfl ) - H(f2) as a measure of dependence, see also 3.1.2. When y = [Yl . . . . . Yn] we get

n

H(f) = Z H ( f g I f i - l ..... f l ) i=1

n

<~ Z H ( f ,) (2.15)

where equality holds if and only if yi . . . . . yn are independent. The above properties for the conditional entropies are obtained for the Shannon

entropy. They may not necessarily go through or have similar forms for the fl- and a-classes of entropies in (2.10) and (2.11 ). For example, in the case of the fl-entropy

in (2.10) we can verify that

HtJ(f)=(fl--1)-l [1-- f f J(Yl,y2)dyldy2]

=( f l -1) - ' [1- f f ~ ( y , ) [ f f~(y2lYl)dy2] dyl

= H#(fl ) + HB(f21fl) (2.16)

where

Ha(f2 If1 ) = f fa(y, )H/~(f2 I f l ) dyl

= Ef, [f~l-lHa(f2lfl )] (2.17)

is the conditional entropy.

Page 8: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

144 A. Ullah/Journal of Statistical Planning and Inference 49 (1996) 137 162

Let us now compare these results with those of Shannon's (fl = 1) entropy in (2.12) and (2.13). It is clear that the additivity o f the Shannon type in (2.12) goes through for the fl-class entropy in (2.16). However the conditional entropy, in contrast to

Shannon's case in (2.13), is now the weighted average of the entropy of the condi-

tional distributions of Y2 given Yl. The weight is f / ~ - l ( y l ) . Further, if Yl and Y2 are

independent then, unlike in Shannon's case where H ( f l , f 2 ) = H ( f l ) + H(f2) , we

get Hli( f l, f2 ) = Hl~(f l ) + Hl~(f 2 ) f fl~(Yl ).

2.2.2. Evaluation o f fl-class entropies In order to see the flavour o f the value of the fl-class entropy, let us consider the

case where f ( y ) is a multivariable normal density with mean vector p and covariance

matrix 2;. Then it can be verified that (2.10) is

1 H[~(f) = fl - 1

_ 1

fl 1 1

fl-1

- - [l- /fl~(y)dy]

1 (2~p~/2l,y,l~/2 _ - - 1 f e_(l~/2)(y_;,),z-,(y_U)dy ]

- - - [1 - (2n) -(n{/~- l )/2)IZI-(;~-~)/2un/2] (2.18)

which is the average information contained in the normal random vector y. When n = 1 and S = a 2 1

H l i ( f ) _ fl _1 1 [1 - - (27~)-([:~-l)/2tT-(fl-1)fl-l/2] (2.19)

We note that the value of H~(f) changes with the choice o f the entropy measure.

For example, in the case of Shannon's entropy (fl = 1) we obtain

H~(f) = H ( f ) = log a + ½ + log x / ~ . (2.20)

As another example, let n = 1 and f ( y ) = O-le -(y/O), y > 0, 0 > 0, be the expo-

nential density. Then it can be verified that the average information in the exponential

variable is

1 m

(2.21)

and it varies with the choice o f entropy. For the Shannon entropy, fl = 1 and H i ( f ) = H ( f ) = 1 + log 0. To see the value of fl for which Hl~(f ) is large, consider 0 = 1 so

that Hl~(f) = 1/ft. It is then clear that for fl~> 1 the maximum information about f ( y ) is given by the Shannon entropy. However, if we allow 0 < fl < 1, then the Shannon entropy has the least information about f ( y ) . This implies that given a density function we can have an entropy measure which may have more information about it than the information given by the Shannon entropy.

Page 9: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

A. Ullah/Journal of Statistical Plannino and lnJ~'rence 49 (1996) 13~162 145

3. Divergence measures between probability distributions

In Section 2.1 we discussed distance measures between two vectors x and y. Now

suppose there is an n × 1 random vector y with continuous probability density

f l ( y ) = f l and fz (Y) = f2 corresponding to two situations or hypotheses. For exam- ple, f l and f2 may represent the income distributions of two groups or regions, may

represent prior and posterior probabilities or represent two different economic models.

For such cases we consider the dissimilarity between two groups or models on the

basis of the distance or divergence between two densities f l and f2. These are given

below. The notations f l and f2 should not be confused with those in Section 2.2.1

where they represent densities of two different random variables. We also note here

that a distance measure can always be considered as a divergence measure, however a

divergence measure may not satisfy all the conditions of a distance given in (2.1) and

thus may not be a distance measure. Such divergence measures will be referred to as

pseudo-distance measures.

3.1. General class of directed divergence measures

A divergence measure can be derived in terms of the ratio 2 = f 2 / f l such that the

difference in the distributions is large when 2 is far from 1. By taking the ratio we are

in fact looking for the divergence of f2 from f l (the ratio f l / f 2 will likewise indicate

the divergence of f l from f2) . However, when 2 = f 2 / f l , the expected divergence

with respect to f l is fyf12 dy = 1, and thus it is a useless measure of divergence. An alternative measure o f divergence can however be developed in terms of the

information (entropy) content in 2, say by a convex function 9(2) such that g ( 1 ) = 0

(see (2.8)). The mean (expected) information content in f2 with respect to f l or

divergence of f2 with respect to f l is then

Hg(f l , f2) ---~ , f l y ~ dy. (3.1)

This divergence measure can be considered as an extension of the entropy functional

in (2.8) and thus we will call it the generalized entropy or relative entropy Junctional, see Csiszfir (1972).

From Jensen's inequality we can show that (3.1) is

/>0 (3.2)

because g ( l ) = 0. Thus the equality holds iff f l = f2. Notice that Hy(fl,f2) H~/( f2 , f l ) = f~.f2g(fl/f2) dy in general, i.e. H~(fl,f2) is not symmetric. Also the triangular inequality may not hold for various choices of g(fj, f2). Therefore Hy() in (3.1) is not the distance in the proper sense as it does not satisfy all the axioms of a distance given in (2.1). The Hy(f l , f2) is actually a directed divergence or pseudo- distance measure.

Page 10: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

146 A. Ullah / Journal of Statistical Plannin O and Inference 49 (1996) 137-162

For the choices of y corresponding to (2.10) and (2.11) we can write the /3- and

a-classes of divergence measures as

H ~ ( f l , f 2 ) = (fl-- 1) -1 f l [ - 1 dy, fl ~ 1 (3.3)

and

l log / ( f ~ ) ~ - ' -- - - f l dy, H ~ ( f l , f 2 ) -- ~ - - 1 c~ ¢ 1, ~ > 0 (3.4)

When fl ~ 1 in (3.3) and ~ ~ 1 in (3.4) we get

j H ( f l , f 2 ) = - f , log ~ dy = El, log ~

which is the Kullback-Leibler (1951) generalization of Shannon's entropy in (2.7), and also known as the relative entropy or divergence measure of f2 from f l . When

f2 is a uniform density, (3.5) becomes the Shannon entropy. It is easy to see that the directed divergence measure in (3.5) is neither symmetric nor does it satisfy the

triangular inequality. Hence it is a pseudo-distance measure.

3.1.1. Generalized cross entropy While in Section 2.1 we introduced the y-class entropy of f l , and in Section 3.1

we gave the y-class generalized entropy of f l , f2 (directed divergence measures), here we present a general class of cross entropy as

H i ( f l , f2 ) = f f l Y(f2 ) dy. (3.6)

Kannapan (1974) suggested Y(f2) = ( f l - l ) - l [ f ~ - / ~ - 1], fl -¢ 1. When fl ~ 1,

9(f2) ~ - l o g f2 and we get

H * ( f l , f 2 ) = - f f l log f2 dy. (3.7)

With this definition, the Kullback-Leibler relative entropy in (3.5) can be written as

H ( f l, f 2 ) = H * ( f l, f 2 ) - H ( f l ). (3.8)

3.1.2. A measure of dependence or mutual information Let n = 2 so that y = [Yl, Y2]- Further let f ( y ) be the joint density and f ( Y l ) = f l

and f (Y2) = f2 be the densities o f Yl and Yz, respectively, as in Section 2.2.1. Then a measure of dependence between yl and y2 can be written from Section 2.2.1 as

D#(f , f , f 2 ) = H # ( f ) - ( H ( f l ) + (f f~ dyl ) H ( f 2 ) ) • (3.9)

The measure of dependence in (3.9) can be described as the measure of mutual infor- mation in the sense that it measures the amount of information Y2 contains about yl .

Page 11: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

A. Ullah/ Journal of Statistical Plannin,q and Inference 49 (1996) 13~162 147

We can verify that, for the Kullback-Leibler divergence (fl = 1), the dependence measure reduces to

D l ( f , f l f 2) = H ( f ) - ( H ( f l ) + H ( f 2 ) )

= E f ( l o g f - - f l ~ 2 ) = H ( f , f , f 2 ) (3.10)

Further, if Yl and y2 are independent then DI~ = DI = 0 . Thus, in practice, one can consider the test for D/~ = 0 as a test for independence between two random variables.

3.2. General classes of non-directional diveroence measures

In Section 3.1 we presented directed divergence measures which are not symmetric, and some of them may not satisfy the triangular inequality. Here we consider the di- vergence measures which are symmetric and these are referred to as the non-directional divergence measures. Some of these may also satisfy the triangular inequality and hence

qualify as distance measures. A class of non-directional divergence measures, which avoid the asymmetry problem,

can be developed in the following ways:

l y ( f l, f 2) =- Hg( f l, f2 ) + Ho(f 2, f l ) (3.11 )

J°( f " f 2 ) = 2H° ( f ' + f 2 ) - H°( f ' ) - H°( f 2 (3.12)

Ko(f l, f2 ) = f y ( f l - f2 )[9(f2 ) - 9(f l )] dy, (3.13)

see Burbea and Rao (1982) for details on their mathematical properties. For the choices

of 9 and H , see (2.10), (2.11), (3.3) and (3.4)-(3.5). Now we turn to some well-known distance measures which are special cases of the

non-directional measures given above. For example for 9( f ) = - log f , it can be seen that (3.11) and (3.13) reduce to

f I, l(fl,f2) = K ( f l , f 2 ) = ( f l - f z ) l o g ~ dy (3.14)

which is the familiar Jeffreys-Kullback-Leibler divergence. This measure does not satisfy the triangular inequality of distance and hence it is a pseudo-distance measure.

Further, it is easy to verify that for fl = 1/2 in (3.3) we can write (3.11) as

l , / 2 ( f l , f 2 ) = 4 [ 1 - f ( f , f a ) ' / 2 d y ]

"~.~ / t l / 2 tl /2"~

Page 12: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

148 A. Ullah/Journal of Statistical Plannino and Inference 49 (1996) 13~162

where d(2)( ) is the L2-norm distance between J 1 fl/2 and fU22 (see (2.6) for p = 2) and it is a proper distance measure satisfying the triangular inequality. Note that d ~,~1/2 f~/2) is also the well-known Matusita (1955, 1967) distance-measure and (2)k J 1 ,

it can be written as

d ~ fl/2 (2)~,J1 ,f~/2) = 2(1 - p) (3.15)

1/2 1/2 where 0~<p = f f l f2 ~<1 is the monotonic function of Bhattacharya's (1943) 'dis- tance' measure between f l and f2 . Actually Bhattacharya's distance is B = - log p so that p = e -B. Another related distance by Bhattacharya is cos -1 p.

Note that (3.14) can serve as a measure of association or affinity between f l and f2 . I f d(2)( ) is close to 2 we have no association and if d(2) is close to zero we have a strong association. In this sense d(2) is somewhat analogous to the Durbin-Watson statistic used in econometrics for measuring serial correlation in data.

A generalized version of the Matusita distance d(2) is the 6-order Hellinger distance given by

d tt6/2 f~/2) - 2 (a/2 6/2 HI j 1 , = ]¢5] d(z)(f l , f 2 )" (3.16)

For ~ = 1, the Hellinger distance is the Matusita distance.

Another distance measure, which is a special of Kg(fl , f2) in (3.13) with 9 given by (2.9) (fl = 2 ) , can be obtained as

K2( f l , f2 ) = d(2)(f l , f 2 )

= f ( f l - f2)2 dy

= [ f f2 + f f 2 ] ( 1 _ p*) (3.17)

where d(2)(fl,f2) is the L2-norm distance between f l and f2 , and where

2 f f l f 2 O~p* = f f2 + f f]~<l (3.18)

is another affinity measure but its relationship with p in (3.15) is not clear. One can extend the L2-norm distance in (3.17) to the Lp-norm distance

d(p) = ( f [ f 2 - f i r ) l/p. For p = 1 we get Kolmogrov ' s distance as d(1)( f l , f2)= f [f2 - f l l dy. Some statistical properties of the Matusita distance and its relationship to Kolmogrov ' s distance are discussed in Kirmani (1971).

3.3. Geodesic distance measure on parameter space

Let the parametric family of the density of y be f ( y ) = f(y,O) where 0 is a k x 1 parameter vector. Then the distance between probability density functions can be defined as the geodesic distance between their parameter values (Rao, 1945; Burbea and Rao, 1982). Such a geodesic distance is calculated as the distance provided by the differential metric of a Riemannian geometry for parameter 0

ds~(0) = dO'14~(O)dO = -dZH4,(f)/df 2 (3.19)

Page 13: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

A. Ullah/Journal of Statistical Planning and Inference 49 (1996) 137 162 149

where H4)(f ) = f 9 ( f ) f dy = f c~(f) dy and

f d/'( ~ c3f Of I4~(0) = - yT "J ) -~ - -0 ; dy. (3.20)

I~o(0) is called the qS-entropy matrix and ds~(0) is the q~-entropy distance. For Shannon's entropy, ~b(f) = - f log f and

I(O) = E [0log f 0log f ] ~ --~-07~ j (3.21)

which is Fisher's information matrix. In some cases the distance on the parameter space described above may coincide with

some well-known distances. For example, for a multivariate normal density f ( y , O) = N(/2, Z) with /2 fixed and q~(f) = - f l o g f the distance (3.19) is 2-1~n(log 2i) 2 where 2i is the solution of I~1 - ~-~21 = o, (see Atkinson and Mitchell (1981)). Further, when X is fixed, the geodesic distance is (/21 -/21)22-1(/21 -/22) which is the Mahalanobis distance. For non-normal examples, see Rao (1985).

3.4. Entropy and divergence measures and their relationships

It is clear from above that there are several types of entropies and divergence mea- sures. Not much is known regarding the optimal selection rules between these measures except that they are based on various axioms. Further, although their mathematical properties are well established, their inferential properties and applications are rela- tively unknown. In the literature, usually the Shannon entropy and Kullback-Leibler divergence measures have been chosen on ad hoc grounds, perhaps because of their simplicity. It is possible that in a given econometric problem the use of different dis- tance measures may give different results. This is similar to the result in Section 2.2.1 where it was indicated that the Shannon entropy may perform differently compared to the fl < 1 class of entropies. The following examples further emphasize these points.

3.4.1. fl-class divergence measures for normal populations and reoressions Let us consider two n x 1 multivariate normal populations with densities f l (Y ) =

N(/2t,S1) and f z (Y) = N ( / 2 2 , ~ ' 2 ) . Then it can easily be verified that the directed Kullback-Leibler divergence from (3.5) is

l D :,I H ( f l , f 2 ) = ½ (trXlZ~-' - n) - ~ log ~ + ½(#1 - #2 )'2221(/2, -/22). (3.22)

Further the non-directional Jeffrey-Kullback-Leibler divergence from (3.14) is

l ( f l , f2) = g ( f l , f e ) + H(f2, f l )

= l t r [ z ~ l Z ~ 2 1 -~- ' ~ 2 ~ 1 ' ] -1- ½(/21 - - / 22 ) t ( z~2 1 -I- z~11)( /21 - - /22) - - n .

(3.23)

Page 14: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

150 A. Ullah/Journal of Statistical Planning and Inference 49 (1996) 137 162

When X1 = Z'2 = 2`, then l ( f l , f2 ) = (121-122 )t~--1(121 __ 1-12 ) which is the well-known Mahalanobis distance, and also the geodesic distance in Section 3.3.

In a special case of regression model y = Zy + u where u ,-~ N(0,a21) and Z is

a non-stochastic matrix, the distance measure l ( f l , f2) may be of interest to discrim- inate between two normal populations or hypotheses H0: f l O ' ) = N(0, a 21) against

Hi: f2 (Y) = N(122 = ZT, a21). This amounts to discriminating H 0 : 7 = 0 against H0: ~ ~ 0. For this case H ( f l , f 2 ) reduces to ( Z 1 = z~ 2 = 321)

1 y'Z'Z~ H ( f l , f 2 ) - 2 a ~ ' (3.24)

7' Z' Z~ I ( f i , f 2 ) = 2 H ( f l , f 2 ) - 0-2 (3.25)

Further, in another special case where f l ( y ) = N(121,0-12I) and f 2 ( Y ) = N(122, 0-2I) we get -

E' 'J l ( f l , f 2 ) = ~ [a~ + a ~ j +½(121-122) '(12z-122) ~ + ~ - n

n [a l 2 0"2] - -n ; 121 = 1 2 2 = 0 . (3.26) = +

Now we consider the fl-class divergence measures given in (3.3) and see what it

looks like for f l ( y ) = N(121,S1) and f 2 ( Y ) = N(122 , -Y '2 ) - This can be obtained as

1 [ [2`2 I(fl- t/2) 12`1 lfl/2e -'2~ ] (3.27) H l ~ ( f l ' f 2 ) - fl - I ~ / 2 - - 1 J

where

/ ~-*--1 / ,r-*-- 1 ! *--1 --I *--1 = 122Z'2 122 "t- 121Z, 1 121 - - (1211 2`~ - 1 -t- 122z~ 2 )14 (Z~1121 -t- '~2 122)

* - - 1 * *.~_ A = S 1 - 1 + 2 ` 2 , S 1 =f l -12`1, 2" 2 - ( f l - 1 ) - l Z 2 . (3.28)

Further we can obtain lt~(fl, f 2 ) = H#(fl , f2 ) + H#(f2 , f l ). In the special case where 121 = 0, 122 = Z? and Z1 = Z2 = 0-21, we get IA] = (32) -n

and ¢ = - f l ( f l - 1)(122 -121 )'(122 -121 )/°-2 -~ - ( f l - 1)fl(TtZtZ7)/0-2. Thus we get

1 [e -(1/2)~ - 1] H B ( f l , f 2 ) - - f l - 1

_ 1_ [1 - e/~(/~-l)H(f''f:)] (3.29)

where H ( f ~ , f 2 ) is the Kullback-Leibler divergence given in (3.24). In another special case where 121 = 122 = 0, 2`1 = 0-21 and 2`2 = 0-21 we get

1 [(o2)"(e - w z 1 ] Hl~(f l ' f2) -- 1 - fl [ (0-2),/~/2 (0-21( f l _ 1) - 0-2fl),,/2 - 1 . (3.30)

From the above results the following points can be noted. First, from (3.29) it is clear that H ~ ( f l , f 2 ) is a monotonic function of the Kullback-Leibler divergence measure.

Page 15: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

A. UllahlJournal of Statistical Plannin9 and Inference 49 (1996) 137-162 151

However the numerical value of H#( f l , f2) for various fl can be quite different from

H ( f l , f 2). For example, from (3.29), H#(f l , f 2) ~-- f lH( f l , f 2) and thus H~/H ~-- fl which is < 1 or > 1 depending on the choice of an entropy, i.e. the value of ft. These

results will have implications when HB(fl, f2) and H( f l , f2) are used for hypothesis testing, see Section 4.

Now we turn to the L2-norm distance d(2)(fl , f2) in (3.15). Again when f (Y l ) = N ( / . t 1 = O, a21) and f(Y2) = N ( , u 2 = Z?,a 21) it can be seen that

d(2)(fl, f 2 ) = 2( 4r~a2)-n/2 [1 - e -(l/2)l(f' f2)] (3.31 )

where l ( f l , f 2 ) is the Jeffrey-Kullback-Leibler distance in (3.5). It is clear that

d(2)(fl , f2) is also a monotonic function of l ( f l , f 2 ) = 2H(fl, . f2). Further

d{2)(fl , f2) ~ (4ga2)-n/21(ft,f2) and hence d(2)(fl , f2) < I ( f t , f 2 ) so long as (4gO "2) > l.

3.4.2. Gaussian stationary processes In a recent paper Zinde-Walsh (1992) considered the case of f l ( y ) = N(0, 271 ) and

f 2 ( y ) = N(0,S2) as in Section 3.4.1 but 2~1 and 2;2 are the covariance matrices of stationary ARMA(p,q) processes. She has shown, among other results, that for the

AR(I ) process H ( f l , f 2 ) < d H but for the MA(1) process H ( f b f 2 ) > d H where d(m is the Hilbert L2-norm distance in (2.4). The result for H ( f l , f 2 ) can be obtained

from (3.22). Further, the results can also be obtained for the HB(f l , f2 ) in (3.27) and

d(2)(fl , fz). This will be a subject of future study. The above results suggest the need for further research in exploring the relationships

and selection rules for the divergence and entropy measures. In a recent paper Csiszfir

(1991) considers the problem of choosing between the Euclidean (L2-norm) distance and the Kullback-Leibler type distance measures in the context of the linear inverse

problem of solving for y in y = ZT+u where y, Z and u are given but the matrix Z can be of less than full rank. The desirable properties of the distance measures are given in a number of axioms; regularity, locality, symmetry, scale invariance, translation

invariance, transitivity. It has been shown that while for some situations, especially for

the singluar cases, the divergence measure Hg(f l , f2 ) in (3.l) is consistent with the selection axioms, for some other situations the Euclidean distance satisfies most of the

axioms.

4. Optimization principles and econometric inference

In Sections 2 and 3 we considered general classes of the entropy and divergence measures which were dependent on unknown density functions. For the density of a specified form we also explored their relationships and the inequalities between them. In practice, however, the true nature of the density f ( y ) is rarely, if ever, known.

Thus, these divergence measures can be useful in practice if we can determine the unknown f ( y ) by using the data on y. We present below the calculation of f by

Page 16: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

152 A. Ullah/Journal of Statistical Planning and Inference 49 (1996) 137 162

the Maxent or minimum divergence (relative entropy) principle and then indicate its applications in econometrics estimation and hypothesis testing.

Let us rewrite the fl-class divergence measures between f ( y ) and fo(Y) from (3.3) a s

H~(f, f o ) - fl - 1 [ \ f o J - 1 (4.1)

where fo(Y) is the prior density of y. Then according to the minimum divergence

principle we obtain f which minimizes H#(f, fo) subject to

f f ( y )dy = 1, Ehr(y) = = 1 (4.2) ar, F m

where hr(y) are chosen from y~, (log y)r, lY- EYl ~, etc. This can be implemented by forming the lagrangian equation in a standard way, and the density obtained is referred

to as f/~(y). The Kullback minimum relative entropy principle minimizes (4.1) for

= 1, i.e. H(f, fo) = f f l o g ( f / fo)dy. In this case the solution is

f ( y ) = f 0 ( y ) e x p [ - 2 0 - 21hl(y) . . . . . )~mhm(y)], (4.3)

where 2~ are obtained so as to satisfy all the constraints. We note that since the min-

imization of f f l og ( f i f o ) is the maximization o f - f f l o g ( f i fo)dy , the minimum relative entropy principle is the same as the Jaynes (1957, 1979) Maxent principle for

uniform f0. In practice we need to replace ~r by )~r which are obtained by replacing ar with the

method of moment estimator ar = Zh~(yi)/n or by the maximum likelihood estimator.

This gives

fcl (Y) = f(Y) = f o(Y)exp[-,~0 - , ~ l ( Y ) . . . . . ~mhm(y)]. (4.4)

There are many numerical algorithms for implementing (4.4), see Agmon et al. (1979)

and Zellner and Highfield (1989). The Maxent principle has been extensively used in the literature to characterize

the distributions given the moment conditions. For example, the normal distribution is the Maxent distribution if the first two moments are specified. For a detailed list of the moment conditions and the corresponding Maxent distributions, see Kapur and Kesavan (1992, p. 359), and for the multivariate case see Ahmed and Gokhale (1989). An important issue in practice is the decision about the parameter m, the number of moment conditions. Also, the form of h needs to be chosen. The choices of h and m

may affect the form of f (y) . In the Bayesian context the Maxent principle has been developed by Zellner (1977)

to obtain the prior distributions of the parameters. Such prior distributions have been referred as maximal data information priors (MDIP). These are obtained by maximizing the average divergence

171 = f H(f(y[O), fo(O))dO (4.5) do

Page 17: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

A. Ullah/Journal of Statistical Planning and Inference 49 (1996) 137-162 153

where f ( y]O) is the data density and fo(O) is the prior density of 0. The MDIP density is then given by

fo(O) = c e x p ( - H ( 0 ) ) (4.6)

where H(O) = - f f ( y l O ) log f ( y l O ) is the average entropy (information) in the data density f (y[O) , and c is a normalizing constant.

An important point is that in both Bayesian and non-Bayesian literature, only Shannon's and Kullback's optimizing principles have been used extensively. This is perhaps because of their simplicity in both theory and application. However there is no reason why one should not explore the other entropy and divergence measures.

The reasons for this are as follows. First, because of the advancement in computing it should be easy to evaluate the Maxent density ) ~ ( y ) based on (4.1). Second, various other measures of divergence described in Sections 2 and 3 satisfy more conditions of a distance (metric) than the Kullback-Leibler divergence measure. Third, the optimiza- tion principles based on the Shannon-Kullback entropies need complicated moment

conditions to determine the distribution. In some situations it may be useful to have a complicated entropy measure but simpler or fewer moment conditions. Fourth, it has been noted in Sections 2 and 3 and later that different divergence measures may have different implications for a given estimation and hypothesis testing problem. Finally, an important reason is that different entropy and divergence measures may lead to different models or distributions. For example, Kapur and Kesavran (1992, p. 331) indicate that

given E y 2 = a for lyl < c, the Maxent density J~2(Y) (fl = 2) based on (4.1) may be quite different from the one obtained by using Shannon's entropy. Similarly, it is

known that while the Maxent principle based on Shannon will lead to log-linear mod- els, the use of fl-class entropy in (4.1) will lead to a log-log. Thus if the log-log model

is the true model then the model based on the Shannon entropy will be a misspecified model.

As a final point we observe that the Maxent density )Cl~(y ) can be considered as a data based semiparametric or non-parametric density estimator. It of course depends on

the choice of entropy fl and the number of moment conditions m. In this respect it is analogous to the non-parametric kernel density estimator which depends on the kernel function and window width. However, while in kernel density estimation the choice of the kernel function does not matter asymptotically and there is now a vast literature

on the selection of window width, the literature on the selection of entropy and rn is quite thin.

4.1. Applications in econometric inference

Here we look into the potential applications of entropy and divergence measures and their optimization principles in the contexts of econometric estimation and hypothesis testing. Both parametric and non-parametric models will be considered.

Page 18: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

154 A. Ullah/Journal of Statistical Planning and Inference 49 (1996) 137-162

4.1.1. Parametric estimation and hypothesis testin 9 Let f2 = f2(Y,O) be the parametric density of y; further consider fl(y) as the

given prior density. Then, from (3.3), the /7-class divergence measures is

'EI( )1 Hl~(fl,f2) = H/~(fl,f2(y,O))- ]7_ 1 f i \~22J - 1 . (4.7)

When /7 = 1 we have the Kullback-Leibler divergence

H(f l , f 2 )= i f t l og f~l d Y = S f l l o g f i d y - i f l l o g f 2 ( y , O ) d y f2(Y, O)

= S log f l d F l - f log f2(y,O)dFi.

A consistent estimator of H(fl,f2) is then

1 n l ~ r - - ~ n

/ O ( f l , f 2 ) = n~--'~ log f~(Yi)- ,-~---" log f2(y,O), (4.8) 1 1

where the first term on the right hand side does not depend on 0. The minimization of H ( f l , f2 ) is then equivalent to the maximization of the log likelihood, ~ log f2(Yi, 0). This shows that the minimization of the /7-class entropy for/7 = 1 provides estimators which are equivalent to the maximum likelihood estimators.

When /7 = 2,

H2(fl'f2)----i(k, f2(y~O) 1) dy ( f~(Yi) (4.9)

where the second approximation is the discrete analogue of H2(fl,f2), and it is the well-known Neyman-Pearson g 2 distance. Thus, for fl = 2, the minimization of the fl-divergence is equivalent to the minimum X 2 estimator.

W e also note that

H(fl, f2 ) = f Y l log ¢i f2(y, 0~--) dy ( f l - f2(Y, O) =ff, log\ f~,(y~-,~ +l) dy

~ i f , \(fl-- f2(y'O))f2(f,O) dy=H2(fi,f2). (4.10)

Thus the minimization of H(fl,f2) is equivalent to the likelihood and minimum )(2 estimators. Further, these estimators are also equivalent to the Pearson method of moments estimator of 0 provided we consider the moment conditions characterizing the distribution of y.

Page 19: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

A. Ullah / Journal of Statistical Planning and Inference 49 (1996) 137-162 155

The above comparisons suggest that the traditional maximum likelihood estimation techniques are directly related to the entropy optimization principles. This is also true in the contexts of Bayesian and non-Bayesian hypothesis testing and model selection, see Akaike (1973) and Zellner and Min (1992). In the context of testing H0: y = u against Hi: y = Zy+u, for example, we observe from Section 3.4.1 t h a t / t # ( f l , f z ) is the monotonic function of ~Z'Z~/62 which is the Wald test statistic; i2 and 62 are the least squares estimators. But the Wald statistic, in the present case, is also a monotonic function of F and the likelihood ratio test. Thus the exact test based on / 4 # ( f l , f 2 ) will be identical with those based on the Wald, the F, the score test and the likelihood ratio test. This is not surprising since f l / f 2 is analogous to the likelihood ratio, and from Bayes's theorem f l / f 2 equals the posterior odd ratio multiplied by the ratio of prior densities.

The asymptotic tests based on/4#(.fl , )c2) may however give conflicting results since, in general, 2n/~t~(j'l,f2 ) may also be asymptotically Z 2 distributed. This conjecture is based on the Kupperman (1958) result for fl = 1. The asymptotic test problem described here is similar to the problem known in the literature for the Wald, score and likelihood tests. More work is needed on the properties of H#( j ' l , f 2 ) ; Zografos et al. (1989) is the starting point. We note that the testing for heteroskedasticity and/or

serial correlation can also be developed by using Hl~(fl , f2) in (3.27).

4.1.2. Maxent estimation of conditional moments Let us consider two random variables yl, y2, and ~O(yl) as a real valued function

such that El~O(yl)] < oo. Then the conditional moment of ~9(yl ) given Y2 is

E(~k(yl )ly2) = M~(y2) -- re(y, ) f (Y l lY2) dyl, (4.11)

and this function Mq,(y2) is of considerable interest in econometrics. For example, when ~k(yl) = Yl, Mq,(y2) is the regression function or the first conditional moment of yl; when qJ(Yl) = (yl - E ( y l ly2)) 2 we have the heteroskedasticity function; and in general where if(Y1) = ( y l - E(ylIy2)) r we have the rth conditional moments of Yl. In non-parametric literature the estimation of M~,(y2) has been carried out by substituting the kernel density estimation of f ( y l ]y2), and the asymptotic properties of these estimators are well established in the statistics literature, see Singh and Tracy (1977), Ullah (1988) and Hardle (1990) for references. The non-parametric kernel estimates of )Q~(y2) are then used to estimate the response or regression coefficients by calculating the derivative of ~7/q,(y2), the asymptotic properties of this are also well

established. One obvious way to develop the Maxent estimation Mq,(y2) will be to follow the

above procedure and replace f ( y l [Yz) by its Maxent estimator j~#(Yl lY2) obtained by maximizing the fl-entropy based conditional entropy in Section 2.2.1 subject to certain conditina moment restrictions. For a special case of ~(yl ) = yl and fl = 1 (Shannon's entropy) this procedure has been carefully developed in Ryu (1991), among others.

Page 20: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

156 A. Ullah l Journal of Statistical Planning and Inference 49 (1996) 137-162

Detailed asymptotic properties of such estimators and their relationships with the non- parametric estimators would be an interesting subject for future research.

The use of Maxent density may also be developed in the semiparametric setting in the same way that the non-parametric kemel estimation has been used in Robinson (1988) and Pagan and Ullah (1990), among others. For example, in the linear model yi = Y23 )+u where V(uly 2) is unknown, a two step generalized least squares estimator of 7 can be obtained by first using the Maxent estimator of V(uly2) suggested above. Further, in the situations of estimating 7 by the maximum likelihood when the density of f ( u ) is unknown a Maxent estimator of f ( u ) can be used to obtain the well-known two-step or iterative maximum likelihood estimator.

In the purely parametric setting Ryu (1991 ) and Zellner and Min (1992) have charac- terized the Maxent conditional distributions f (Y l [Y2) by using the conditional moment conditions. This has in turn characterized the conditional regression models. For exam- ple, if the first two conditional moments of yl are Y2~ and ~r 2, respectively, then the Maxent model is the linear model Yl = Y2~ + u. Ryu (1991) also provides the moment functions which would characterize orthonormal regression, Gallant's (1981) Fourier flexible form, and other models.

4.1.3. Specification testiny usin9 entropy~distance In econometrics, several diagnostic or misspecification tests are usually carried out in

checking up the data consistency of an econometric model, see Pagan and Hall (1983) for details. In most of these testing problems both null and altemative hypotheses are usually parametric, as in Section 4.1.1, and the tests are developed by using Wald, likelihood ratio or Rag's score test. Here we consider some of the testing problems where at least the alternative is in non-parametric form and indicate the application of //-class entropy/divergence measures in developing the test statistics.

First we consider the test of independence between two random variables yl and

);2, f (Yl ,Y2) = f ( Y l ) f ( Y 2 ) or f = f l f 2 . Ahmad (1980) considered this problem by testing H0: p* = 0 against Hi: p* :~ 0 where p* is the affinity measure defined in (3.18). The statistic t~* based on the kernel estimation of the expression in the numerator and denominator of p*, however, has a degenerate distribution under the null. Recently Ahmad and Cerrito (1989) and Fan and Gencay (1992) have modified the estimator of p* such that it has a limiting normal distribution. Fan and Gencay (1992) have also used the fi* statistic to test for symmetry and normality.

In an interesting paper, Robinson (1991) considers the Kullback-Leibler divergence to test for serial independence in time series, that is, H0:D1 = 0 against H1:D1 ~ 0, where D1 = Dl( f , f l f 2 ) is as given in (3.10). Under a certain weighting scheme on the observations he also establishes the asymptotic normality of D l ( j ~ , f l f 2 ) where f , jTlj" 2 are the kemel density estimators. Fan (1992) proposes a test for indepen-

dence by testing H0: d(2)(f, f l f 2 ) = 0, where d(2)(f, f l f 2 ) is the Euclidean distance in (3.15). She shows the asymptotic normality of d(2)(f, J'l f 2 ) without weight- ing assumptions and claims that her test has good power against certain alternatives.

Page 21: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

A. Ullah / Journal of Statistical Plannin9 and Inference 49 (1996) 13~162 157

Parzen (1979, 1982, 1985) has used a quantile based Shannon entropy and a Kullback- Leibler divergence measure to study the tests for goodness of fit, equality of two distributions and independence of two random variables.

An alternative to the above test is to consider the tests based on the fl-class entropy divergence measure, that is, H0: D/~ = 0 against HI: D/~ # 0, where D# = D#(f, f l f2 ) can be written from (3.9). One can estimate D#(f, f l f2) by the kernel density es- timator b~(f, f l f 2 ) as before. The distribution of/)/~ is, however, not known. The choice of /7 can perhaps be determined on the basis of the power of the test. For fl = 1, the test statistic is as given by Robinson (1991). Another alternative is to use the above statistics by substituting the Maxent density estimates instead of the kernel density estimates. The asymptotic properties of such alternatives are not yet developed.

Now we consider the problem of testing the goodness of fit, H0: f (y) = f0(Y, 0) against Hi: f (y) ~ fo(y,O) where fo(y,O) is the assumed parametric density, for example normal, gamma, beta or exponential. We can write the test statistic as

A~ = exp[H/~(f)] / exp[H/~(f0)] (4.12)

where H~(fo) is the fl-class entropy for f0 = fo(Y, 0). For fl = 1, (4.12) was proposed by Gokhale (1983). An alternative statistic is A~ = H~(fo)- H#(f) which, for fl = 1, was used by Dudewicz and van der Meulen (1981 ) for testing the uniform distribution and Vasicek (1976) for the normal distribution. In practice we replace H~(fo) by H~(f0) which is, from Section 2.2.2,

(fl -- 1 )-1[1 -- 2rt-(#-l)/2d-(#-l)fl-l/2]

for the normal case and

for the exponential fo(Y, O) = O-le -yO, y > 0, 0 > 0; 0 = )5 and d 2 = ~(yi - fi)2/ (n - 1 ). Further H~(f) can be replaced by

= ( 8 - 1 ) - ' l - (yi) ,

where f is a Maxent estimator. An alternative is to consider the kernel density estimator of f . The estimatr of H~ is also proposed by Vasicek (1976) for/7 = 1. His estimator follows by first changing the variable u = F(x) so that

1

H ( f ) = - i f log f d y = L log q(u)du, (4.13)

where q(u) is the quantile density function, and then estimating it by

n n I'7I(f)=~Zlog(~(Xi+m-Xi_m)) (4.14) i=1

Page 22: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

158 A. Ullah/Journal of Statistical Planning and Inference 49 (1996) 13~162

where given the observations Yi, xl ~ X 2 ~ " • " ~ X n denote the order statistics with xi = xl (i <1) and xi = x, ,( i>n) and m <n/2 is a positive integer. Theil's entropy corresponds to m = 1. Mack (1988) has done extensive simulations to study the distribution and power of the statistic A~ based on the kernel, Vasicek and entropy estimates of H ( f ) , (also see Parzen, 1985). Mack's findings indicate that generally the power of Vasicek and entropy based A~' are better than the kernel based A~'. The distribution and power of the test statistics A~ are not known for 13 ~ 1.

Another test statistic can be developed by using the affinity measure

p = f f~/2f l /2 dy (4.15)

in (3.15). Note that under H0, p -- 1. Thus testing for H0: f ( y ) = fo implies testing for

H0: p = l , HI : p < 1. (4.16)

~l/2 ~1/2 For this we consider the test statistic t = (/5 - 1 ) / ~ where fi = f Y o J l and J'0 and f l are the parametric and entropy based estimators described above. Matusita (1955) and Khan and Yaqub (1980) have studied the asymptotic properties of fi for the discrete and continuous densities, respectively.

In addition to the above tests, Fan (1992), Fan and Gencay (1992) and Parzen (1985) provide alternative goodness-of-fit tests. A comparative study of the properties of all these tests will be a useful subject of future research.

5. Other econometric applications and conclusion

In Section 4 we looked into the applications of entropy and divergence measures in the contexts of estimation and hypothesis testing. Here we summarize the results of some other applications.

Theil (1967) and Maasoumi (1986), among others, have made useful applications of the entropy and Kullback-Leibler divergence for studying income inequality and welfare economics.

In the context of econometrics estimation, an early application appeared in Tintner (1960) where he develops a weighted regression estimation by the Maxent principle. Recently, in an interesting paper Soofi (1990) considers the linear regression model y = Z7 + u, where u ,-~ N(0,rr21) and Z is an n x k nonstochastic matrix and provides a measure of muiticollinearity index. This is developed by calculating the divergence H ( f l , f 2 ) where f l is the distribution of the least squares estimator ~ and f2 is the distribution of f when Z'Z = I. The multicollinearity index is 2 H ( f l , f 2 ) =

~'~2f t + ~ log •j - - k where ).j are the eigenvalues of Z'Z.

Page 23: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

A. Ullah/Journal of Statistical Planning and Inference 49 (1996) 13~162 159

Parzen (1982, 1990) has considered the Maxent principles in time series. Galbraith and Zinde-Walsh (1992) utilize the Hillbert distance to estimate the moving average parameters. Piccolo (1990) employs a distance measure in cluster analysis for time series and seasonal adjustment filters.

Akaike (1973), Sawa (1978), Klein and Brown (1989), Sin and White (1992) and Rissanen (1987a, b) explore the divergence measures for model selection; also see Zellner and Min (1992) where the Bayesian method is considered. Fan (1992), Fan and Gencay (1992), Parzen (1979), Robinson (1991) Ullah and Singh (1990) and White (1992), among others, have explored the uses of distance measures for hypothesis testing.

Recently Golan and Judge (1992), Judge and Golan (1992), Golan (1992), O'sullivan (1986) and Grandy (1985) have explored the use of the Maxent principle for recovering the vectors 7 and u in the ill-posed inverse problem y = Z7 + u where Z is an n × k non-invertible matrix of known values. The ill-posed inverse problem implies the problem of recovering 7 when Z and y do not have enough information. This can happen when Z is not of full column rank or k > n. In these situations the standard least squares estimators of 7 are well known to have poor statistical performance. An alternative is to use the quadratic optimization method which obtains 7 by minimizing ( y - Z j ' ( y - Z ? ) + 2~'C?, where 2 is the smoothing parameter. The solution is the well-known ridge estimator ~ = (Z'Z + 2C)-lz'y. Recently Silver (1991), among others, considered 7' log 7 = Y'~Yi log 7i, in place of the quadratic penalty function ~'C7. However, in this Shannon entropy set up ~ is >~0 and the elements of ], add to 1. Further, the solution of 7 is non-linear and does not have the closed form. For an excellent treatment of these methods, see Judge and Golan (1992) where they also suggest the Maxent method which is free from the probability constraints on 7. They have also shown that their methods work well for the well-posed models also. These results are very promising and are going to be very useful in statistical estimation and econometric model specification.

Johnson and Omberg (1988) have explored the entropy approach to macroeconomics. More recently, Sengupta (1991, 1992) has applied conditional Shannon entropy in Section 2.2.1 for studying market volatility and portfolio models.

For the applications of entropy in other sciences such as regional and urban planning, non-linear spectral analysis, queuing theory, pattern recognition and statistics we refer to Kapur and Kesavan (1992).

We have described above some of the applications of entropy/divergence measures in econometrics and economics. Earlier we also presented general classes of entropy and divergence measures and their applications in econometric hypothesis testing and esti- mation. The sampling properties of these classes of entropy measures are not yet well established in the contexts of econometrics inference. Also the computational aspects of implementing the Maxent principles for the multivariate cases need to be developed. In addition we observe that most of the applied studies described above are based on the Shannon and the Kullback-Leibler entropies. The robustness of these studies against other classes of entropy/divergence measures need to be further investigated.

Page 24: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

160 A. Ullah / Journal of Statistical Planning and Inference 49 (1996) 13~162

References

Agmon, N., Y. Alhassid and R.D. Levine (1979). An algorithm for finding the distribution of maximal entropy. J. Comput. Phys. 30, 250-259.

Ahmad, I.A. (1980). Nonparametric estimation of an affinity measure between two absolutely continuous distributions with hypothesis testing applications. Ann. Inst. Statist. Math. 32, Part A, 233-240.

Ahmad, I. and P.B. Cerrito (1989). Hypothesis testing using density estimation: one and two sample goodness-of-fit tests. Manuscript, University of Northern Illinois.

Ahmed, N.A. and Gokhale, D.V. (1989). Entropy expressions and their estimators for multivariate distributions. IEEE Trans. Inform. Theory 35, 6884591.

Aitkinson, C. and A.F.S. Mitchell (1981). Rao's distance measure. Sankhya 43, 345-365. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In: B.N.

Petror and F.Csaki, Eds., Proc. 2nd lnternat. Syrup. on Information Theory, Akademial Kiado, Budapest, 267-281.

Bhattacharyya, A. (1943). On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 35, 99-109.

Boltzman, L. (1872). Neitere Studien fiber das Warmegleichgewicht unter Gasmolekulen. K. Akad. Wiss. (Wein) Sitzb. 66, 275.

Buck, B. and V.A. Maculay (1991). Maximum Entropy in Action. Clarendon Press, Oxford. Burbea, J. and C.R. Rao (1982). Entropy differential metric, distance and divergence measures in probability

spaces: A unified approach. J. Multivariate Anal. 12, 576-579. Cover, T.M. and J.A. Thomas (1991). Elements of Information Theory. Wiley, New York. Clausius, R. (1864). Abhaudlungen fiber die mechanische Wiirmetheorie Friedrich Vieweg. Braunschweig. Csisz~ir, I. (1972). A class of measures of informativity of observation channels. Period. Math. Hungar. 2,

191-213. Csisz~ir, I. (1991). Why least squares and maximum entropy? An axoimatic approach to inference for linear

inverse problems Ann. Statist. 19, 2032-2066. Davis, H.T. (1941). The Theory of Econometrics. The Principia Press, Bloomington, IN. Dudewicz, E.J. and E.C. van der Meulen (1981). Entropy-based tests of uniformity. J. Amer. Statist. Assoc.

76, 967-974. Fan, Y. (1992). Testing the goodness-of-fit of a parametric density function by kernel method. Manuscript,

University of Windsor. Fan, Y. and R. Gencay (1992). Hypothesis testing based on modified nonparametric estimation of an affinity

measure between two distributions. Department of Economics, University of Windsor. J. Nonparametric Statist..(to appear).

Galbraith, J. and V. Zinde-Walsh (1992). Simple estimators for MA models based on approximations evaluated using Hilbert distances. Working paper, Department of Economics, McGill University.

Gallant, A.R. (1981). On the bias of flexible functional forms and an essentially unbiased form: The fourier flexible form. J. Econometrics 15, 211-245.

Gokhale, D.V. (1983). On entropy-based goodness-of-fit tests. Comput. Statist. Data Anal. 1, 157-165. Golan, A. (1992). A multivariable stochastic theory of size distribution of firms with empirical evidence.

Adv. Econometrics 10. Golan, A. and G. Judge (1992). Recovering information in the case of undetermined problems and incomplete

data. Manuscript, University of California, Berkeley. Grandy, W.T. (1985). Incomplete information and generalized inverse problems In: C.R. Smith and

W.T. Grandy, Eds., Maximum Entropy and Bayesian Methods in Inverse Problems, D. Reidel, Boston. Hardle, W. (1990). Applied Nonparametric Regression. Cambridge University Press, New York. Hartley, R.V.L (1928). Transmission of information. Bell System Tech..Z 7, 535. Havrda, J. and F. Charvat (1967). Quantification method in classification processes: concept of structural

~t-entropy. Kybernetika 3, 30-35. Jaynes, E.T. (1957). Information theory and statistical mechanics. Phys. Rev. 106, 620-630. Jaynes, E. (1979). Concentration of distributions. In: R. Rosenkrantz, ed., E. Jaynes: Papers on Probability,

Statistics and Statistical Physics, Reidel, Dordrecht. Johnson, H. and E. Omberg (1988). The information theory approach to macroeeonomics. Manuscript,

University of California, Riverside.

Page 25: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

A. Ullah/Journal of Statistical Planning and Inference 49 (1996) 137-162 161

Judge, G. and A. Golan (1992). Recovering information in the case of ill-posed inverse problems with noise. Manuscript, University of California, Berkeley.

Kannappan, P. (1974). On a generalization of some measures in information theory. Glas. Mat. 81-92. Kapur, J.N. (1989). Maximum Entropy Models in Science and Engineering Wiley Eastern, New Delhi. Kapur, J.N and H.K. Kesavan (1992). Entropy Optimization Principles with Applications. Academic Press,

San Diego. Khan, A.H. and M. Yaqub (1980). Distribution of a distance function. Ann. Inst. Statist. Math. 32, 247 253. Khinchin, A.I. (1957). Mathematical Foundation of Information Theory. Dover, New York. Kirmani, S.N.U.A. (1971). Some limiting properties of Matusita's measure of distance. Ann. Inst. Statist.

Math. 157-162. Klein, R.W. and S.J. Brown (1989). Model selection under "minimal" prior information. Bell Labs., Murray

Hill. Kullback, S. and R.A. Leibler (1951). On information and sufficiency. Ann. Math. Statist. 22, 79-86. Kupperman, H. (1958). Probabilities of hypotheses and information statistics in sampling from exponential

class of population. Ann. Math. Statist. 29, 571-574. Maasoumi, E. (1986). The measurement and decomposition of multidimensional inequality. Econometrica

54, 991-997. Mack, S.P. (1988). A comparative study of entropy estimators and entropy based goodness-of-fit tests. Ph.D.

thesis, University of California, Riverside. Mahalanobis, P.C. (1936). On the generalized distance in statistics. Proc. Nat. Inst. Sci. 2, 49-55. Matsusita, K. (1955). Decision rules based on distance for problems of fit, two samples and estimation. Ann.

Math. Statist. 26, 631~41. Matsusita, K. (1967). On the notion of affinity of several distributions and some of its applicatons Ann.

Inst. Statist. Math. 19, 181-192. O'sullivan, F. (1986). A statistical perspective on ill-posed inverse problems. Statist. Sci. 1, 502-527. Pagan, A.R. and A.D. Hall (1983). Diagnostic tests as residual analysis. Econometric Rev. 2, 159-218. Pagan, A.R. and A. Ullah (1990). Semiparametric estimation of regression parametrics. Manuscript,

University of California, Riverside. Parzen, E. (1979). Nonparametric statistical data modelling. J. Amer. Statist. Assoc. (with discussion) 74,

105-131. Parzen, E. (1982). Maximum entropy interpretation of autoregressive spectral densities. Statist. ProbiTb.

Letters 1, 2-6. Parzen, E. (1985). Entropy interpretation of goodness-of-fit tests. Manuscript, Texas A&M University. Parzen, E. (1990). Time series, statistics, and information. IMA preprint series #663. Piccolo, D. (1990). A distance measure for classifying ARIMA models. J. Tim Ser. Anal. 11, 153-164. Rao, C.R. (1945). Information and accuracy attainable in the estimation of statistical parameters. Bull.

Calcutta Math. Soc. 37, 81-91. Rao, C.R. (1982). Diversity and dissimilarity coefficients. Theor. Pop. Bio. 21, 24-43. Rao, C.R. (1985). Differential metrics in probability spaces based on entropy and divergence measures.

Technical report 85-08, University of Pittsburgh. Renyi, A. (1961). On measures of entropy and information. In: Proc. 4th Berkeley Syrup. Statist; 1Iol. 1,

Probability, 1, University of California Press, California, 547-561. Rissanen, J. (1987a). Stochastic Complexity. J. Roy. Statist. Soc. Ser. B 49, 223-265. Rissanen, J. (1987b). Stochastic complexity and the MDL principle. Econometric Rev. 6, 85-102. Robinson, P.M. (1988). Semiparametric econometrics: A survey. J. Appl. Econometrics. Robinson, P.M. (1991). Consistent nonparametric entropy based testing. Econom. Stud. 58, 437-453. Ryu, H.K. (1991). Orthonomal basis and maximum entropy estimation of probability density and regression

functions. Ph.D. thesis, University of Chicago. Sawa, Takamitsu (1978). Information criteria for discriminating among alternative regression models.

Econometrica 46(6), 1273-1291. Sengupta, J.K. (1991). Maximum probability dominance and portfolio theory. J. Optim. Theory AppL 71,

341-351. Sengupta, J.K. (1992). Market volatility and portfolio efficiency. Manuscript, University of California, Santa

Barbara. Shannon, C.E. (1948). The mathematical theory of communication. Bell System Tech. J., and in Shannon

and Weaver (1949) The Mathematical Theory of Communication. University of Illinois, Urbana, 3-91.

Page 26: Entropy, divergence and distance measures with econometric ... · Note that distance d(x, y) is also called metric. The definition of metric space follows from the definition of distance

162 A. Ullah/ Journal of Statistical Planning and Inference 49 (1996) 13~162

Silver, R.N. (1991). Classical statistical inference and maximum entropy. Los Alamos National Laboratory. Sin, C. and H. White (1992). Information criteria for selecting possibly misspecified parametric models. DP

92-42, University of California, San Diego. Singh, R.S. and D.S. Tracy (1977). Strongly consistent estimators of the order regression curves and rates

of convergence. Z. Wahrsch Verw Grebiete 40, 339-348. Soofi, E.S. (1990). Effects of collinearity on information about regression coefficients. J. Eeonometries,

255-274. Soofi, E.S. and D.V. Gokhale (1991). An information criterioun for normal regression estimation. Statist.

Probab. Lett., 111-117. Theil, H. (1967). Economics and Information Theory. Rand McNally, Chicago, I11. Tintner, G. (1960). Applications of the theory of information to the problem of weighted regression. Onore

De Corrado Gini, 1, 29, Rome Inst. De Statist. University. Ullah, A. (1983). Divergence, distance and entropy measures: unification and applications. Manuscript,

University of Western Ontario. Ullah, A. (1988). Nonparametric estimation of econometric functions. Canad J. Econom. 21, 625-658. Ullah, A. and R.S. Singh (1990). Estimation of a probability density function with applications to

nonparametric inference in econometrics. In: B. Raj, Ed., Advances in Econometrics and Modelling, Kleiwer.

Vasicek, O. (1976). A test for normality based on sample entropy, Journal of Royal Statistical Society 38, 54-59.

Weiner, N. (1949). Cybernetics. New York, Wiley. White, H. (1992). Estimation, Inference and Specification Analysis. Cambridge University Press,

forthcoming. Zellner, A. (1977). Maximal data information prior distributions. In: A. Aykac and C. Brumat, eds., New

Developments m the Applications of Bayesian Methods, North Holland, Amsterdam, 211-232. Zellner, A. (1991). Bayesian methods and entropy in economics and econometrics. In: W.T. Grandy, Jr. and

L.H. Schick, Eds., Maximum Entropy and Bayesian Methods, Kluwer, Amsterdam, 17-31. Zellner, A. and R. Highfield (1988). Calculation of maximum entroy distributions and approximation of

marginal distributions. J. Econometrics 37, 195-209. Zellner, A. and C. Min (1992). Bayesian analysis, model selection and prediction. Manuscript, University

of Chicago. Zinde-Walsh, V. (1990). The consequences of misspecification in time series processes. Econom. Lett. 32,

237-241. Zinde-Walsh, V. (1992). Distance measures in the spaces of stochastic processes and relations between them.

Manuscript, McGill University. Zografos, K., K. Ferentinos and T. Papaioannou (1989). Limiting properties of some measures of information.

Ann. Inst. Statist. Math. 3, 451-446.