134
On Generalized Measures of Information with Maximum and Minimum Entropy Prescriptions A Thesis Submitted For the Degree of Doctor of Philosophy in the Faculty of Engineering by Ambedkar Dukkipati Computer Science and Automation Indian Institute of Science Bangalore – 560 012 March 2006

Thesis Entropy Very Impotant

Embed Size (px)

Citation preview

Page 1: Thesis Entropy Very Impotant

On Generalized Measures of Information with

Maximum and Minimum Entropy Prescriptions

A Thesis

Submitted For the Degree of

Doctor of Philosophy

in the Faculty of Engineering

by

Ambedkar Dukkipati

Computer Science and Automation

Indian Institute of Science

Bangalore – 560 012

March 2006

Page 2: Thesis Entropy Very Impotant

Abstract

Kullback-Leibler relative-entropy or KL-entropy of P with respect to R defined as∫

Xln

dP

dRdP ,

where P and R are probability measures on a measurable space (X,M), plays a basic role in the

definitions of classical information measures. It overcomes a shortcoming of Shannon entropy

– discrete case definition of which cannot be extended to nondiscrete case naturally. Further,

entropy and other classical information measures can be expressed in terms of KL-entropy and

hence properties of their measure-theoretic analogs will follow from those of measure-theoretic

KL-entropy. An important theorem in this respect is the Gelfand-Yaglom-Perez (GYP) Theorem

which equips KL-entropy with a fundamental definition and can be stated as: measure-theoretic

KL-entropy equals the supremum of KL-entropies over all measurable partitions of X . In this

thesis we provide the measure-theoretic formulations for ‘generalized’ information measures, and

state and prove the corresponding GYP-theorem – the ‘generalizations’ being in the sense of Renyi

and nonextensive, both of which are explained below.

Kolmogorov-Nagumo average or quasilinear mean of a vector x = (x1, . . . , xn) with respect

to a pmf p = (p1, . . . , pn) is defined as 〈x〉ψ = ψ−1(∑n

k=1 pkψ(xk))

, where ψ is an arbitrary

continuous and strictly monotone function. Replacing linear averaging in Shannon entropy with

Kolmogorov-Nagumo averages (KN-averages) and further imposing the additivity constraint – a

characteristic property of underlying information associated with single event, which is logarith-

mic – leads to the definition of α-entropy or Renyi entropy. This is the first formal well-known

generalization of Shannon entropy. Using this recipe of Renyi’s generalization, one can prepare

only two information measures: Shannon and Renyi entropy. Indeed, using this formalism Renyi

characterized these additive entropies in terms of axioms of KN-averages. On the other hand, if

one generalizes the information of a single event in the definition of Shannon entropy, by replac-

ing the logarithm with the so called q-logarithm, which is defined as lnq x = x1−q−11−q , one gets

what is known as Tsallis entropy. Tsallis entropy is also a generalization of Shannon entropy

but it does not satisfy the additivity property. Instead, it satisfies pseudo-additivity of the form

x⊕q y = x+ y + (1 − q)xy, and hence it is also known as nonextensive entropy. One can apply

Renyi’s recipe in the nonextensive case by replacing the linear averaging in Tsallis entropy with

KN-averages and thereby imposing the constraint of pseudo-additivity. A natural question that

arises is what are the various pseudo-additive information measures that can be prepared with this

recipe? We prove that Tsallis entropy is the only one. Here, we mention that one of the impor-

tant characteristics of this generalized entropy is that while canonical distributions resulting from

‘maximization’ of Shannon entropy are exponential in nature, in the Tsallis case they result in

power-law distributions.

i

Page 3: Thesis Entropy Very Impotant

The concept of maximum entropy (ME), originally from physics, has been promoted to a gen-

eral principle of inference primarily by the works of Jaynes and (later on) Kullback. This connects

information theory and statistical mechanics via the principle: the states of thermodynamic equi-

librium are states of maximum entropy, and further connects to statistical inference via select the

probability distribution that maximizes the entropy. The two fundamental principles related to

the concept of maximum entropy are Jaynes maximum entropy principle, which involves maxi-

mizing Shannon entropy and the Kullback minimum entropy principle that involves minimizing

relative-entropy, with respect to appropriate moment constraints.

Though relative-entropy is not a metric, in cases involving distributions resulting from relative-

entropy minimization, one can bring forth certain geometrical formulations. These are reminiscent

of squared Euclidean distance and satisfy an analogue of the Pythagoras’ theorem. This property

is referred to as Pythagoras’ theorem of relative-entropy minimization or triangle equality and

plays a fundamental role in geometrical approaches to statistical estimation theory like informa-

tion geometry. In this thesis we state and prove the equivalent of Pythagoras’ theorem in the

nonextensive formalism. For this purpose we study relative-entropy minimization in detail and

present some results.

Finally, we demonstrate the use of power-law distributions, resulting from ME-prescriptions

of Tsallis entropy, in evolutionary algorithms. This work is motivated by the recently proposed

generalized simulated annealing algorithm based on Tsallis statistics.

To sum up, in light of their well-known axiomatic and operational justifications, this thesis

establishes some results pertaining to the mathematical significance of generalized measures of

information. We believe that these results represent an important contribution towards the ongoing

research on understanding the phenomina of information.

ii

Page 4: Thesis Entropy Very Impotant

To

Bhirava Swamy and Bharati who infected me with a disease called Life

and to

all my Mathematics teachers who taught me how to extract sweetness from it.

-------------

. . . lie down in a garden and extract from the disease,

especially if it’s not a real one, as much sweetness as

possible. There’s a lot of sweetness in it.

FRANZ KAFKA IN A LETTER TO MILENA

iii

Page 5: Thesis Entropy Very Impotant

Acknowledgements

No one deserves more thanks for the success of this work than my advisers Prof. M. Narasimha

Murty and Dr. Shalabh Bhatnagar. I wholeheartedly thank them for their guidance.

I thank Prof. Narasimha Murty for his continued support throughout my graduate student

years. I always looked upon him for advice – academic or non-academic. He has always been

a very patient critique of my research approach and results; without his trust and guidance this

thesis would not have been possible. I feel that I am more disciplined, simple and punctual after

working under his guidance.

The opportunity to watch Dr. Shalabh Bhatnagar in action (particularly during discussions)

has fashioned my way of thought in problem solving. He has been a valuable adviser, and I hope

my three and half years of working with him have left me with at least few of his qualities.

I am thankful to the Chairman, Department of CSA for all the support.

I am privileged to learn mathematics from the great teachers: Prof. Vittal Rao, Prof. Adi Murty

and Prof. A. V. Gopala Krishna. I thank them for imbibing in me the rigour of mathematics.

Special thanks are due to Prof. M. A. L. Thathachar for having taught me.

I thank Dr. Christophe Vignat for his criticisms and encouraging advice on my papers.

I wish to thank CSA staff Ms. Lalitha, Ms. Meenakshi and Mr. George for being of very great

help in administrative works. I am thankful to all my labmates: Dr. Vishwanath, Asharaf, Shahid,

Rahul, Dr. Vijaya, for their help. I also thank my institute friends Arjun, Raghav, Ranjna.

I will never forget the time I spent with Asit, Aneesh, Gunti, Ravi. Special thanks to my music

companions, Raghav, Hari, Kripa, Manas, Niki. Thanks to all IISc Hockey club members and my

running mates, Sai, Aneesh, Sunder. I thank Dr. Sai Jagan Mohan for correcting my drafts.

Special thanks are due to Vinita who corrected many of my drafts of papers, this thesis, all the

way from DC and WI. Thanks to Vinita, Moski and Madhulatha for their care.

I am forever indebted to my sister Kalyani for her prayers. My special thanks are due to my

sister Sasi and her husband and to my brother Karunakar and his wife. Thanks to my cousin

Chinni for her special care. The three great new women in my life: my nieces Sanjana (3 years),

Naomika (2 years), Bhavana (3 months) who will always be dear to me. I reserve my special love

for my nephew (new born).

I am indebted to my father for keeping his promise that he will continue to guide me even

though he had to go to unreachable places. I owe everything to my mother for taking care of every

need of mine. I dedicate this thesis to my parents and to my teachers.

iv

Page 6: Thesis Entropy Very Impotant

Contents

Abstract i

Acknowledgements iv

Notations viii

1 Prolegomenon 1

1.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2 Essentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.2.1 What is Entropy? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.2.2 Why to maximize entropy? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.3 A reader’s guide to the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 KN-averages and Entropies:Renyi’s Recipe 19

2.1 Classical Information Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.1 Shannon Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.1.2 Kullback-Leibler Relative-Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2 Renyi’s Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.1 Hartley Function and Shannon Entropy . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.2 Kolmogorov-Nagumo Averages or Quasilinear Means . . . . . . . . . . . . . . . 27

2.2.3 Renyi Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3 Nonextensive Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3.1 Tsallis Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3.2 q-Deformed Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.4 Uniqueness of Tsallis Entropy under Renyi’s Recipe . . . . . . . . . . . . . . . . . . . . 38

2.5 A Characterization Theorem for Nonextensive Entropies . . . . . . . . . . . . . . . . . . 43

3 Measures and Entropies:Gelfand-Yaglom-Perez Theorem 46

3.1 Measure Theoretic Definitions of Classical Information Measures . . . . . . . . . . . . . 48

3.1.1 Discrete to Continuous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.1.2 Classical Information Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.1.3 Interpretation of Discrete and Continuous Entropies in terms of KL-entropy . . . 54

v

Page 7: Thesis Entropy Very Impotant

3.2 Measure-Theoretic Definitions of Generalized Information Measures . . . . . . . . . . . 56

3.3 Maximum Entropy and Canonical Distributions . . . . . . . . . . . . . . . . . . . . . . . 58

3.4 ME-prescription for Tsallis Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.4.1 Tsallis Maximum Entropy Distribution . . . . . . . . . . . . . . . . . . . . . . . 60

3.4.2 The Case of Normalized q-expectation values . . . . . . . . . . . . . . . . . . . 62

3.5 Measure-Theoretic Definitions Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.5.1 On Measure-Theoretic Definitions of Generalized Relative-Entropies . . . . . . . 64

3.5.2 On ME of Measure-Theoretic Definition of Tsallis Entropy . . . . . . . . . . . . 69

3.6 Gelfand-Yaglom-Perez Theorem in the General Case . . . . . . . . . . . . . . . . . . . . 70

4 Geometry and Entropies:Pythagoras’ Theorem 75

4.1 Relative-Entropy Minimization in the Classical Case . . . . . . . . . . . . . . . . . . . . 77

4.1.1 Canonical Minimum Entropy Distribution . . . . . . . . . . . . . . . . . . . . . 78

4.1.2 Pythagoras’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.2 Tsallis Relative-Entropy Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.2.1 Generalized Minimum Relative-Entropy Distribution . . . . . . . . . . . . . . . 81

4.2.2 q-Product Representation for Tsallis Minimum Entropy Distribution . . . . . . . 82

4.2.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.2.4 The Case of Normalized q-Expectations . . . . . . . . . . . . . . . . . . . . . . 86

4.3 Nonextensive Pythagoras’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.3.1 Pythagoras’ Theorem Restated . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.3.2 The Case of q-Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.3.3 In the Case of Normalized q-Expectations . . . . . . . . . . . . . . . . . . . . . 92

5 Power-laws and Entropies: Generalization of Boltzmann Selection 95

5.1 EAs based on Boltzmann Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.2 EA based on Power-law Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6 Conclusions 106

6.1 Contributions of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.3 Concluding Thought . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

vi

Page 8: Thesis Entropy Very Impotant

Bibliography 111

vii

Page 9: Thesis Entropy Very Impotant

Notations

R The set (field) of real numbers

R+ [0,∞)

Z+ The set of +ve integers

2X Power set of the set X

#E Cardinality of a set E

χE : X → 0, 1 Characteristic function of a set E ⊆ X

(X,M) Measurable space, where X is a nonempty set and M is a σ-algebra

a.e Almost everywhere

〈X〉 Expectation of random variable X

EX Expectation of random varible X

〈X〉ψ KN-average: expectation of random variable X with respect to a function ψ

〈X〉q q-expectation of random varibale X

〈〈X〉〉q Normalized q-expectation of random variable X

ν µ Measure ν is absolutely continuous w.r.t. measure µ

S Shannon entropy functional

Sq Tsallis entropy functional

Sα Renyi entropy functional

Z Partition function of maximum entropy distributions

Z Partition function of minimum relative-entropy distribution

viii

Page 10: Thesis Entropy Very Impotant

1 Prolegomenon

Abstract

This chapter serves as an introduction to the thesis. The purpose is to motivate thediscussion on generalized information measures and their maximum entropy pre-scriptions by introducing in broad brush-strokes a picture of the information theoryand its relation with statistical mechanics and statistics. It also has road-map of thethesis, which should serve as a reader’s guide.

Having an obsession to quantify – put it formally, finding a way of assigning a real

number to (measure) any phenomena that we come across, it is natural to ask the fol-

lowing question. How one would measure ‘information’? The question was asked at

the beginning of this age of information sciences and technology itself and a satisfac-

tory answer was given. The theory of information was born . . . a ‘bandwagon’ . . .

as Shannon (1956) himself called it.

“A key feature of Shannon’s information theory is the discovery that the colloquial

term information can often be given a mathematical meaning as a numerically measur-

able quantity, on the basis of a probabilistic model, in such a way that the solution of

many important problems of information storage and transmission can be formulated

in terms of this measure of the amount of information. This information measure has

a very concrete operational interpretation: roughly, it equals the minimum number of

binary digits needed, on the average, to encode the message in question. The coding

theorems of information theory provide such overwhelming evidence for the adequate-

ness of Shannon’s information measure that to look for essentially different measures

of information might appear to make no sense at all. Moreover, it has been shown

by several authors, starting with Shannon (1948), that the measure of the amount of

information is uniquely determined by some rather natural postulates. Still, all the

evidence that Shannon’s information measure is the only possible one, is valid only

within the restricted scope of coding problems considered by Shannon. As Renyi

pointed out in his fundamental paper (Renyi, 1961) on generalized information mea-

sure, in other sorts of problems other quantities may serve just as well or even better

as measures of information. This should be indicated either by their operational sig-

nificance (pragmatic approach) or by a set of natural postulates characterizing them

1

Page 11: Thesis Entropy Very Impotant

(axiomatic approach) or, preferably, by both.”

The above passage is quoted from a critical survey on information measures by

Csiszar (1974), which summarizes the significance of information measures and scope

of generalizing them. Now we shall see the details.

Information Measures and Generalizations

The central tenet of Shannon’s information theory is the construction of a measure of

“amount of information” inherent in a probability distribution. This construction is in

the form of a functional that returns a real number which is supposed to be considered

as the amount of information of a probability distribution, and hence the functional is

known as information measure. The underlying concept in this construction is that it

complements the amount of information with amount of uncertainty and it happens to

be logarithmic.

The logarithmic form of information measure dates back to Hartley (1928), who

introduced the practical measure of information as the logarithm of the number of pos-

sible symbol sequences, where the distribution of events are considered to be equally

probable. It was Shannon (1948), and independently Wiener (1948), who introduced

a measure of information of general finite probability distribution p with point masses

p1, . . . , pn as

S(p) = −n∑

k=1

pk ln pk .

Owing to its similarity as a mathematical expression to Boltzmann entropy in ther-

modynamics, the term ‘entropy’ is adopted in the information sciences and used with

information measure synonymously. Shannon demonstrated many nice properties of

his entropy measure to be called itself a measure of information. One important prop-

erty of Shannon entropy is the additivity, i.e., for two independent distributions, the

entropy of the joint distribution is the sum of the entropies of the two distributions.

Today, information theory is considered to be a very fundamental field which inter-

sects with physics (statistical mechanics), mathematics (probability theory), electrical

engineering (communication theory) and computer science (Kolmogorov complexity)

etc. (cf. Fig. 1.1, pp. 2, Cover & Thomas, 1991).

Now, let us examine an alternate interpretation of the Shannon entropy functional

that is important to study its mathematical properties and its generalizations. Let X

be the underlying random variable, which takes values x1, . . . xn; we use the notation

2

Page 12: Thesis Entropy Very Impotant

p(xk) = pk, k = 1, . . . n. Then, Shannon entropy can be written as expectation of a

function of X as follows. Define a function H which assigns each value xk that X

takes, the value − ln p(xk) = − ln pk, for k = 1, . . . n. The quantity − ln pk is known

as the information associated with the single event xk with probability pk, also known

as Hartley information (Aczel & Daroczy, 1975). From this what one can infer is

that Shannon entropy expression is an average of Hartley information. Interpretation

of Shannon entropy, as an average of information associated with a single event, is

central to Renyi generalization.

Renyi entropies were introduced into mathematics by Alfred Renyi (1960). The

original motivation was strictly formal. The basic idea behind Renyi’s generalization

is that any putative candidate for an entropy should be a mean, and thereby he uses a

well known idea in mathematics that the linear mean, though most widely used, is not

the only possible way of averaging, however, one can define the mean with respect to

an arbitrary function. Here one should be aware that, to define a ‘meaningful’ gener-

alized mean, one has to restrict the choice of functions to continuous and monotone

functions (Hardy, Littlewood, & Polya, 1934).

Following the above idea, once we replace the linear mean with generalized means,

we have a set of information measures each corresponding to a continuous and mono-

tone function. Can we call every such entity an information measure? Renyi (1960)

postulated that an information measure should satisfy additivity property which Shan-

non entropy itself does. The important consequence of this constraint is that it restricts

the choice of function in a generalized mean to linear and exponential functions: if we

choose a linear function, we get back the Shannon entropy, if we choose an exponential

function, we have well known and much studied generalization of Shannon entropy

Sα(p) =1

1 − αln

n∑

k=1

pαk ,

where α is a parameter corresponding to an exponential function, which specifies the

generalized mean and is known as entropic index. Renyi has called them entropies of

order α (α 6= 1, α > 0); they include Shannon’s entropy in a limiting sense, namely,

in the limit α → 1, α-entropy retrieves Shannon entropy. For this reason, Shannon’s

entropy may be called entropy of order 1.

Renyi studied extensively these generalized entropy functionals in his various pa-

pers; one can refer to his book on probability theory (Renyi, 1970, Chapter 9) for a

summary of results.

While Renyi entropy is considered to be the first formal generalization of Shannon

3

Page 13: Thesis Entropy Very Impotant

entropy, Havrda and Charvat (1967) observed that for operational purposes, it seems

more natural to consider the simpler expression∑n

k=1 pαk as an information measure

instead of Renyi entropy (up to a constant factor). Characteristics of this information

measure are studied by Daroczy (1970), Forte and Ng (1973), and it is shown that

this quantity permits simpler postulational characterizations (for the summary of the

discussion see (Csiszar, 1974)).

While generalized information measures, after Renyi’s work, continued to be of

interest to many mathematicians, it was in 1988 that they came to attention in Physics

when Tsallis reinvented the above mentioned Havrda and Charvat entropy (up to a

constant factor), and specified it in the form (Tsallis, 1988)

Sq(p) =1 −∑k p

qk

q − 1.

Though this expression looks somewhat similar to the Renyi entropy and retrieves

Shannon entropy in the limit q → 1, Tsallis entropy has the remarkable, albeit not

yet understood, property that in the case of independent experiments, it is not addi-

tive. Hence, statistical formalism based on Tsallis entropy is also termed nonextensive

statistics.

Next, we discuss what information measures to do with statistics.

Information Theory and Statistics

Probabilities are unobservable quantities in the sense that one cannot determine the val-

ues of these corresponding to a random experiment by simply an inspection of whether

the events do, in fact, occur or not. Assessing the probability of the occurrence of some

event or of the truth of some hypothesis is the important question one runs up against

in any application of probability theory to the problems of science or practical life.

Although the mathematical formalism of probability theory serves as a powerful tool

when analyzing such problems, it cannot, by itself, answer this question. Indeed, the

formalism is silent on this issue, since its goal is just to provide theorems valid for

all probability assignments allowed by its axioms. Hence, recourse is necessary to

an additional rule which tells us in which case one ought to assign which values to

probabilities.

In 1957, Jaynes proposed a rule to assign numerical values to probabilities in cir-

cumstances where certain partial information is available. Jaynes showed, in particu-

lar, how this rule, when applied to statistical mechanics, leads to the usual canonical

4

Page 14: Thesis Entropy Very Impotant

distributions in an extremely simple fashion. The concept he used was ‘maximum

entropy’.

With his maximum entropy principle, Jaynes re-derived Gibbs-Boltzmann statisti-

cal mechanics a la information theory in his two papers (Jaynes, 1957a, 1957b). This

principle states that the states of thermodynamic equilibrium are states of maximum

entropy. Formally, let p1, . . . , pn be the probabilities that a particle in a system has

energies E1, . . . , En respectively, then well known Gibbs-Boltzmann distribution

pk =e−βEk

Zk = 1, . . . , n,

can be deduced from maximizing the Shannon entropy functional −∑nk=1 pk ln pk

with respect to the constraint of known expected energy∑n

k=1 pkEk = U along with

the normalizing constraint∑n

k=1 pk = 1. Z is called the partition function and can be

specified as

Z =

n∑

k=1

e−βEk .

Though use of maximum entropy has its historical roots in physics (e.g., Elsasser,

1937) and economics (e.g., Davis, 1941), later on, Jaynes showed that a general method

of statistical inference could be built upon this rule, which subsumes the techniques of

statistical mechanics as a mere special case. The principle of maximum entropy states

that, of all the distributions p that satisfy the constraints, one should choose the distri-

bution with largest entropy. In the above formulation of Gibbs-Boltzmann distribution

one can view the mean energy constraint and normalizing constraints as the only avail-

able information. Also, this principle is a natural extension of Laplace’s famous prin-

ciple of insufficient reason, which postulates that the uniform distribution is the most

satisfactory representation of our knowledge when we know nothing about the random

variate except that each probability is nonnegative and the sum of the probabilities is

unity; it is easy to show that Shannon entropy is maximum for uniform distribution.

The maximum entropy principle is used in many fields, ranging from physics (for

example, Bose-Einstein and Fermi-Dirac statistics can be made as though they are

derived from the maximum entropy principle) and chemistry to image reconstruction

and stock market analysis, recently in machine learning.

While Jayens was developing his maximum entropy principle for statistical infer-

ence problems, a more general principle was proposed by Kullback (1959, pp. 37)

which is known as the minimum entropy principle. This principle comes into picture

in problems where inductive inference is to update from a prior probability distribu-

tions to a posterior distribution when ever new information becomes available. This

5

Page 15: Thesis Entropy Very Impotant

principle states that, given a prior distribution r, of all the distributions p that sat-

isfy the constraints, one should choose the distribution with the least Kullback-Leibler

relative-entropy

I(p‖r) =

n∑

k=1

pk lnpk

rk.

Minimizing relative-entropy is equivalent to maximizing entropy when the prior is a

uniform distribution. This principle laid the foundations for an information theoretic

approach of statistics (Kullback, 1959) and plays important role in certain geometrical

approaches of statistical inference (Amari, 1985).

Maximum entropy principle together with minimum entropy principle is referred

as ME-principle and the inference based on these principles are collectively known

as ME-methods. Papers by Shore and Johnson (1980) and by Tikochinsky, Tishby,

and Levine (1984) paved the way for strong theoretical justification for using ME-

methods in inference problems. A more general view of ME fundamentals are reported

by Harremoes and Topsøe (2001).

Before we move on we briefly explain the relation between ME and inference

methods using the well-known Bayes’ theorem. The choice between these two updat-

ing methods is dictated by the nature of the information being processed. When we

want to update our beliefs about the value of certain quantities θ on the basis of infor-

mation about the observed values of other quantities x - the data - we must use Bayes’

theorem. If the prior beliefs are given by p(θ), the updated or posterior distribution is

p(θ|x) ∝ p(θ)p(x|θ). Being a consequence of the product rule for probabilities, the

Bayesian method of updating is limited to situations where it makes sense to define

the joint probability of x and θ. The ME-method, on the other hand, is designed for

updating from a prior probability distribution to a posterior distribution when the infor-

mation to be processed is testable information, i.e., it takes the form of constraints on

the family of acceptable posterior distributions. In general, it makes no sense to pro-

cess testable information using Bayes’ theorem, and conversely, neither does it make

sense to process data using ME. However, in those special cases when the same piece

of information can be both interpreted as data and as constraint then both methods can

be used and they agree. For more details on ME and Bayes’ approach one can refer to

(Caticha & Preuss, 2004; Grendar jr & Grendar, 2001).

An excellent review of ME-principle and consistency arguments can be found in

the papers by Uffink (1995, 1996) and by Skilling (1984). This subject is dealt with in

applications in the book of Kapur and Kesavan (1997).

6

Page 16: Thesis Entropy Very Impotant

Power-law Distributions

Despite the great success of the standard ME-principle, it is a well known fact that

there are many relevant probability distributions in nature which are not easily deriv-

able from Jaynes-Shannon prescription: Power-law distributions constitute an interest-

ing example. If one sticks to the standard logarithmic entropy, ‘awkward constraints’

are needed in order to obtain power-law type distributions (Tsallis et al., 1995). Does

Jaynes ME-principle suggest in a natural way the possibility of incorporating alter-

native entropy functionals to the variational principle? It seems that if one replaces

Shannon entropy with its generalization, ME-prescriptions ‘naturally’ result in power-

law distributions.

Power-law distributions can be obtained by optimizing Tsallis entropy under ap-

propriate constraints. The distribution thus obtained is termed the q-exponential distri-

bution. The associated q-exponential function of x is eq(x) = [1 + (1 − q)x]1

1−q

+ , with

the notation [a]+ = max0, a, and converges to the ordinary exponential function

in the limit q → 1. Hence formalism of Tsallis offers continuity between Boltzmann-

Gibbs distribution and power-law distribution, which is given by the nonextensive pa-

rameter q. Boltzmann-Gibbs distribution is a special case of the power-law distribution

of Tsallis prescription; as we set q → 0, we get exponential.

Here, we take up an important real-world example, where significance of power-

law distribution can be demonstrated.

The importance of power-law distributions in the domain of computer science was

first precipitated in 1999 in the study of connectedness of World Wide Web (WWW).

Using a Web crawler, Barabasi and Albert (1999) mapped the connectedness of the

Web. To their surprise, the web did not have an even distribution of connectivity (so-

called “random connectivity”). Instead, a very few network nodes (called “hubs”)

were far more connected than other nodes. In general, they found that the probability

p(k) that a node in the network connects with k other nodes was, in a given network,

proportional to k−γ , where the degree exponent γ is not universal and depends on

the detail of network structure. Pictorial depiction of random networks and scale-free

networks is given in Figure 1.1.

Here we wish to point out that, using the q-exponential function, p(k) is rewritten

as p(k) = eq(kκ), where q = 1 + 1

γ and κ = (q− 1)k0. This implies that the Barabasi-

Albert solution optimizes the Tsallis entropy (Abe & Suzuki, 2004).

One more interesting example is the distribution of scientific articles in journals

(Naranan, 1970). If the journals are divided into groups, each containing the same

7

Page 17: Thesis Entropy Very Impotant

Figure 1.1: Structure of Random and Scale-Free Networks

number of articles on a given subject, then the number of journals in the succeeding

groups from a geometrical progression.

Tsallis nonextensive formalism had been applied to analyze the various phenomena

which exhibit power-laws, for example stock markets (Queiros et al., 2005), citations

of scientific papers (Tsallis & de Albuquerque, 2000), scale-free network of earth-

quakes (Abe & Suzuki, 2004), models of network packet traffic (Karmeshu & Sharma,

2006) etc. To a great extent, the success of Tsallis proposal is attributed to the ubiquity

of power law distributions in nature.

Information Measures on Continuum

Until now we have considered information measures in the discrete case, where the

number of configurations is finite. Is it possible to extend the definitions of information

measures to non-discrete cases, or to even more general cases? For example can we

write Shannon entropy in the continuous case, naively, as

S(p) = −∫p(x) ln p(x) dx

for a probability density p(x)? It turns out that in the above continuous case, entropy

functional poses a formidable problem if one interprets it as an information measure.

Information measures extended to abstract spaces are important not only for math-

ematical reasons, the resultant generality and rigor could also prove important for even-

tual applications. Even in communication problems discrete memoryless sources and

channels are not always adequate models for real-world signal sources or communica-

tion and storage media. Metric spaces of functions, vectors and sequences as well as

random fields naturally arise as models of source and channel outcomes (Cover, Gacs,

& Gray, 1989). The by-products of general rigorous definitions have the potential for

8

Page 18: Thesis Entropy Very Impotant

proving useful new properties, for providing insight into their behavior and for finding

formulas for computing such measures for specific processes.

Immediately after Shannon published his ideas, the problem of extending the defi-

nitions of information measures to abstract spaces was addressed by well-known math-

ematicians of the time, Kolmogorov (1956, 1957) (for an excellent review on Kol-

mogorov’s contributions to information theory see (Cover et al., 1989)), Dobrushin

(1959), Gelfand (1956, 1959), Kullback (Kullback, 1959), Pinsker (1960a, 1960b),

Yaglom (1956, 1959), Perez (1959), Renyi (1960), Kallianpur (1960), etc.

We now examine why extending the Shannon entropy to the non-discrete case is

a nontrivial problem. Firstly, probability densities mostly carry a physical dimension

(say probability per length) which give the entropy functional the unit of ‘ln cm’, which

seems somewhat odd. Also in contrast to its discrete case counterpart this expression

is not invariant under a reparametrization of the domain, e.g. by a change of unit.

Further, S may now become negative, and is not bounded both from above or below

so that new problems of definition appear cf. (Hardy et al., 1934, pp. 126).

These problems are clarified if one considers how to construct an entropy for a

continuous probability distribution starting from the discrete case. A natural approach

is to consider the limit of the finite discrete entropies corresponding to a sequence of

finite partitions of an interval (on which entropy is defined) whose norms tend to zero.

Unfortunately, this approach does not work, because this limit is infinite for all con-

tinuous probability distributions. Such divergence is also obtained–and explained–if

one adopts the well-known interpretation of the Shannon entropy as the least expected

number of yes/no questions needed to identify the value of x, since in general it takes

an infinite number of such questions to identify a point in the continuum (of course,

this interpretation supposes that the logarithm in entropy functional has base 2).

To overcome the problems posed by the definition of entropy functional in con-

tinuum, the solution suggested was to consider the expression in discrete case (cf.

Gelfand et al., 1956; Kolmogorov, 1957; Kullback, 1959)

S(p|µ) = −n∑

k=1

p(xk) lnp(xk)

µ(xk),

where µ(xk) are positive weights determined by some ‘background measure’ µ. Note

that the above entropy functional S(p|µ) is the negative of Kullback-Leibler relative-

entropy or KL-entropy when we consider that µ(xk) are positive and sum to one.

Now, one can show that the present entropy functional, which is defined in terms of

KL-entropy, however has a natural extension to the continuous case (Topsøe, 2001,

9

Page 19: Thesis Entropy Very Impotant

Theorem 5.2). This is because, if one now partitions the real line in increasingly finer

subsets, the probabilities corresponding to p and the background weights correspond-

ing to µ are both split simultaneously and the logarithm of their ratio will generally

not diverge.

This is how KL-entropy plays an important role in definitions of information mea-

sures extended to continuum. Based on these above ideas one can extend the infor-

mation measures on measure space (X,M, µ); µ is exactly the same as that appeared

in the above definition of the entropy functional S(p|µ) in discrete case. The entropy

functionals in both the discrete and continuous cases can be retrieved by appropri-

ately choosing the reference measure µ. Such a definition of information measures

on measure spaces can be used in ME-prescriptions, which are consistent with the

prescriptions when their discrete counterparts, are used.

One can find the continuum and measure-theoretic aspects of entropy functionals

in the information theory text of Guiasu (1977). A concise and very good discussion

on ME-prescriptions of continuous entropy functionals can be found in (Uffink, 1995).

What is this thesis about?

One can see from the above discussions that the two generalizations of Shannon en-

tropy, Renyi and Tsallis, originated or developed from different fields. Though Renyi’s

generalization originated in information theory, it has been studied in statistical me-

chanics (e.g., Bashkirov, 2004) and statistics (e.g., Morales et al., 2004). Similarly,

Tsallis generalization was mainly studied in statistical mechanics when it was pro-

posed, but, now, Shannon-Khinchin axioms have been extended to Tsallis entropy (Su-

yari, 2004a) and applied to statistical inference problems (e.g., Tsallis, 1998). This

elicits no surprise because from the above discussion one can see that information

theory is naturally connected to statistical mechanics and statistics.

The study of the mathematical properties and applications of generalized infor-

mation measures and, further, new formulations of the maximum entropy principle

based on these generalized information measures constitute a currently growing field

of research. It is in this line of inquiry that this thesis presents some results pertain-

ing to mathematical properties of generalized information measures and their ME-

prescriptions, including the results related to measure-theoretic formulations of the

same.

Finally, note that Renyi and Tsallis generalizations can be ‘naturally’ applied to

Kullback-Leibler relative entropy to define generalized relative-entropy measures, which

10

Page 20: Thesis Entropy Very Impotant

are extensively studied in the literature. Indeed, the major results that we present in

this thesis are related to these generalized relative-entropies.

1.1 Summary of Results

Here we give a brief summary of the main results presented in this thesis. Broadly, re-

sults presented in this thesis can be divided into those related to information measures

and their ME-prescriptions.

Generalized Means, Renyi’s Recipe and Information Measures

One can view Renyi’s formalism as a tool, which can be used to generalize informa-

tion measures and thereby characterize them using axioms of Kolmogorov-Nagumo

averages (KN-averages). For example, one can apply Renyi’s recipe in the nonexten-

sive case by replacing the linear averaging in Tsallis entropy with KN-averages and

thereby impose the constraint of pseudo-additivity. A natural question arises is what

are the various pseudo-additive information measures that one can prepare with this

recipe? In this thesis we prove that only Tsallis entropy is possible in this case, using

which we characterize Tsallis entropy based on axioms of KN-averages.

Generalized Information Measures in Abstract Spaces

Owing to the probabilistic settings for information theory, it is natural that more gen-

eral definitions of information measures can be given on measure spaces. In this thesis

we develop measure-theoretic formulations for generalized information measures and

present some related results.

One can give measure-theoretic definitions for Renyi and Tsallis entropies along

similar lines as Shannon entropy. One can also show that, as is the case with Shannon

entropy, these measure-theoretic definitions are not natural extensions of their discrete

analogs. In this context we present two results: (i) we prove that, as in the case of

classical ‘relative-entropy’, generalized relative-entropies, whether Renyi or Tsallis,

can be extended naturally to the measure-theoretic case, and (ii) we show that, ME-

prescriptions of measure-theoretic Tsallis entropy are consistent with the discrete case.

Another important result that we present in this thesis is the Gelfand-Yaglom-Perez

(GYP) theorem for Renyi relative-entropy, which can be easily extended to Tsallis

relative-entropy. GYP-theorem for Kullback-Leibler relative-entropy is a fundamental

11

Page 21: Thesis Entropy Very Impotant

theorem which plays an important role in extending discrete case definitions of various

classical information measures to the measure-theoretic case. It also provides a means

to compute relative-entropy and study its behavior.

Tsallis Relative-Entropy Minimization

Unlike the generalized entropy measures, ME of generalized relative-entropies is not

much addressed in the literature. In this thesis we study Tsallis relative-entropy mini-

mization in detail.

We study the properties of Tsallis relative-entropy minimization and present some

differences with the classical case. In the representation of such a minimum relative-

entropy distribution, we highlight the use of the q-product, an operator that has been

recently introduced to derive the mathematical structure behind Tsallis statistics.

Nonextensive Pythagoras’ Theorem

It is a common practice in mathematics to employ geometric ideas in order to obtain

additional insights or new methods even in problems which do not involve geometry

intrinsically. Maximum and minimum entropy methods are no exception.

Kullback-Leibler relative-entropy, in cases involving distributions resulting from

relative-entropy minimization, has a celebrated property reminiscent of squared Eu-

clidean distance: it satisfies an analog of Pythagoras’ theorem. And hence, this prop-

erty is referred to as Pythagoras’ theorem of relative-entropy minimization or triangle

equality, and plays a fundamental role in geometrical approaches to statistical estima-

tion theory like information geometry. We state and prove the equivalent of Pythago-

ras’ theorem in the nonextensive case.

Power-law Distributions in EAs

Recently, power-law distributions have been used in simulated annealing, which claims

to perform better than classical simulated annealing. In this thesis we demonstrate the

use of power-law distributions in evolutionary algorithms (EAs). The proposed algo-

rithm use Tsallis generalized canonical distribution, which is a one-parameter gener-

alization of the Boltzmann distribution, to weigh the configurations in the selection

mechanism. We provide some simulation results in this regard.

12

Page 22: Thesis Entropy Very Impotant

1.2 Essentials

This section details some heuristic explanations for the logarithmic nature of Hart-

ley and Shannon entropies. We also discuss some notations and why the concept of

“maximum entropy” is important.

1.2.1 What is Entropy?

The logarithmic nature of Hartley and Shannon information measures, and their ad-

ditivity properties can be explained by heuristic arguments. Here we give one such

explanation (Renyi, 1960).

To characterize an element of a set of size n we need log2 n units of information,

where a unit is a bit. The important feature of the logarithmic information measure is

its additivity: If a set E is a disjoint union of m n-tuples: E1, . . . , Em, then we can

specify an element of this mn-element set E in two steps: first we need ln2m bits of

information to describe which of the sets E1, . . . , Em, say Ek, contains the element,

and we need log2 n further bits of information to tell which element of this set Ek is

the considered one. The information needed to characterize an element of E is the

‘sum’ of the two partial informations. Indeed, log2 nm = log2 n+ log2m.

The next step is due to Shannon (1948). He has pointed out that Hartley’s formula

is valid only if the elements of E are equiprobable; if their probabilities are not equal,

the situation changes and we arrive at the formula (2.15). If all the probabilities are

equal to 1n , Shannon’s formula (2.15) reduces to Hartley’s formula: S(p) = log2 n.

Shannon’s formula has the following heuristic motivation. Let E be the disjoint

union of the sets E1, . . . , En having N1, . . . , Nn elements respectively (∑n

k=1Nk =

N ). Let us suppose that we are interested only in knowing the subset Ek to which a

given element of E belongs. Suppose that the elements of E are equiprobable. The

information characterizing an element of E consists of two parts: the first specifies the

subset Ek containing this particular element and the second locates it within Ek. The

amount of the second piece of information is log2Nk (by Hartley’s formula), thus it

depends on the index k. To specify an element ofE we need log2N bits of information

and as we have seen it is composed of the information specifying Ek – its amount will

be denoted by Hk – and of the information within Ek. According to the principle

of additivity, we have log2N = Hk + log2Nk or Hk = log2NNk

. It is plausible to

define the information needed to identify the subset Ek which the considered element

13

Page 23: Thesis Entropy Very Impotant

belongs to as the weighted average of the informations Hk, where the weights are the

probabilities that the element belongs to the Ek’s. Thus,

S =

n∑

k=1

Nk

NHk ,

from which we obtain the Shannon entropy expression using the above interpretations

of Hk = log2NNk

and using the notation pk = NkN .

Now we note one more important idea behind the Shannon entropy. We frequently

come across Shannon entropy being treated as both a measure of uncertainty and of

information. How is this rendered possible?

IfX is the underlying random variable, then S(p) is also written as S(X) though it

does not depend on the actual values of X . With this, one can say that S(X) quantifies

how much information we gain, on an average, when we learn the value of X . An

alternative view is that the entropy of X measures the amount of uncertainty about

X before we learn its value. These two views are complementary; we can either view

entropy as a measure of our uncertainty before we learn the value ofX , or as a measure

of how much information we have gained after we learn the value of X .

Following this one can see that Shannon entropy for the most ‘certain distribu-

tion’ (0, . . . , 1, . . . 0) returns the value 0, and for the most ‘uncertain distribution’

( 1n , . . . ,

1n) returns the value lnn. Further one can show the inequality

0 ≤ S(p) ≤ lnn ,

for any probability distribution p. The inequality S(p) ≥ 0 is easy to verify. Let us

prove that for any probability distribution p = (p1, . . . , pn) we have

S(p) = S(p1, . . . , pn) ≤ S

(1

n, . . . ,

1

n

)= lnn . (1.1)

Here, we shall see the proof. I One way of showing this property is by using the

Jensen inequality for real-valued continuous functions. Let f(x) be a real-valued con-

tinuous concave function defined on the interval [a, b]. Then for any x1, . . . , xn ∈ [a, b]

and any set of non-negative real numbers λ1, . . . , λn such that∑n

k=1 λk = 1, we have

n∑

k=1

λkf(xk) ≤ f

(n∑

k=1

λkxk

). (1.2)

For convex functions the reverse inequality is true. Setting a = 0, b = 1, xk = pk,

λk = 1n and f(x) = −x lnx we obtain

−n∑

k=1

1

npk ln pk ≤ −

(n∑

k=1

1

npk

)ln

(n∑

k=1

1

npk

),

14

Page 24: Thesis Entropy Very Impotant

and hence the result.

Alternatively, one can use Lagrange’s method to maximize entropy subject to the

normalization condition of probability distribution∑n

k=1 pk = 1. In this case the

Lagrangian is

L ≡ −n∑

k=1

pk ln pk − λ

(n∑

k=1

pk − 1

),

Differentiating with respect to p1, . . . , pn, we get

−(1 + ln pk) − λ = 0 , k = 1, . . . n

which gives

p1 = p2 = . . . = pn =1

n. (1.3)

The Hessian matrix is

− 1n 0 . . . 0

0 − 1n . . . 0

......

. . ....

0 0 . . . − 1n

which is always negative definite, so that the values from (1.3) determine a maximum

value, which, because of the concavity property, is also the global maximum value.

Hence the result. J

1.2.2 Why to maximize entropy?

Consider a random variable X . Let the possible values X takes be x1, . . . , xn that

possibly represent the outcomes of an experiment, states of a physical system, or just

labels of various propositions. The probability with which the event xk is selected is

denoted by pk, for k = 1, . . . , n. Our problem is to assign probabilities p1, . . . , pn.

Laplace’s principle of insufficient reason is the simplest rule that can be used when

we do not have any information about a random experiment. It states that whenever we

have no reason to believe that one case rather than any other is realized, or, as is also

put, in case all values of X are judged to be ‘equally possible’, then their probabilities

are equal, i.e

pk =1

n, k = 1, . . . n.

15

Page 25: Thesis Entropy Very Impotant

We can restate the principle as, the uniform distribution is the most satisfactory rep-

resentation of our knowledge when we know nothing about the random variate except

that each probability is nonnegative and the sum of the probabilities is unity. This rule,

of course, refers to the meaning of the concept of probability , and is therefore sub-

ject to debate and controversy. We will not discuss this here, one can refer to (Uffink,

1995) for a list of objections to this principle reported in the literature.

Now having the Shannon entropy as a measure of uncertainty (information), can

we generalize the principle of insufficient reason and say that with the available infor-

mation, we can always choose the distribution which maximizes the Shannon entropy?

This is what is known as the Jaynes’ maximum entropy principle which states that of

all the probability distributions that satisfy given constraints, choose the distribution

which maximizes Shannon entropy. That is if our state of knowledge is appropriately

represented by a set of expectation values, then the “best”, least unbiased probability

distribution is the one that (i) reflects just what we know, without “inventing” unavail-

able pieces of knowledge, and, additionally, (ii) maximize ignorance: the truth, all

the truth, nothing but the truth. This is the rationale behind the maximum entropy

principle.

Now we shall examine this principle in detail. Let us assume that some information

about the random variable X is given which can be modeled as a constraint on the set

of all possible probability distributions. It is assumed that this constraint exhaustively

specifies all relevant information about X . The principle of maximum entropy is then

the prescription to choose that probability distribution p for which the Shannon entropy

is maximal under the given constraint.

Here we take simple and often studied type of constraints, i.e. the case where

expectation of X is given. Say we have the constraint

n∑

k=1

xkpk = U ,

where U is the expectation of X . Now to maximize Shannon entropy with respect

to the above constraint, together with the normalizing constraint∑n

k=1 pk = 1, the

Lagrangian can be written as

L ≡ −n∑

k=1

pk ln pk − λ

(n∑

k=1

pk − 1

)− β

(n∑

k=1

xkpk − U

).

Setting the derivatives of the Lagrangian with respect to p1, . . . , pn equal to zero, we

get

ln pk = −λ− βxk

16

Page 26: Thesis Entropy Very Impotant

The Lagrange parameter λ can be specified by the normalizing constraint. Finally,

maximum entropy distribution can be written as

pk =e−βxk∑nk=1 e

−βxk ,

where the parameter β is determined by the expectation constraint.

Note that, one can extend this method to more than one constraint specified with

respect to some arbitrary functions; for details see (Kapur & Kesavan, 1997).

The maximum entropy principle subsumes the principle of insufficient reason. In-

deed, in the absence of reasons, i.e., in the case where none or only trivial constraints

are imposed on the probability distribution, its entropy S(p) is maximal when all prob-

abilities are equal. Although, as a generalization of the principle of insufficient reason,

maximum entropy principle inherits all objections associated with its infamous prede-

cessor. Interestingly it does cope with some of the objections; for details see (Uffink,

1995).

Note that calculating the Lagrange parameters in maximum entropy methods is

a non-trivial task and the same holds for calculating maximum entropy distributions.

Various techniques to calculate maximum entropy distributions can be found in (Ag-

mon et al., 1979; Mead & Papanicolaou, 1984; Ormoneit & White, 1999; Wu, 2003).

Maximum entropy principle can be used for a wide variety of problems. The book

by Kapur and Kesavan (1997) gives an excellent account of maximum entropy methods

with emphasis on various applications.

1.3 A reader’s guide to the thesis

Notation and Delimiters

The commonly used notation in the thesis is given in the beginning of the chapters.

When we write down the proofs of some results which are not specified in the

Theorem/Lemma environment, we denote the beginning and ending of proofs by I

and J respectively. Otherwise the end of proofs that are part of the above are identified

by . Some additional explanations with in the results are included in the footnotes.

To avoid proliferation of symbols we use the same notation for different concepts

if this does not cause ambiguity; the correspondence should be clear from the con-

text. For example whether it is a maximum entropy distribution or minimum relative-

entropy distribution we use the same symbols for Lagrange multipliers.

17

Page 27: Thesis Entropy Very Impotant

Roadmap

Apart from this chapter this thesis contains five other chapters. We now briefly outline

a summary of each chapter.

In Chapter 2, we present a brief introduction of generalized information measures

and their properties. We discuss how generalized means play a role in the information

measures and present a result related to generalized means and Tsallis generalization.

In Chapter 3, we discuss various aspects of information measures defined on

measure spaces. We present measure-theoretic definitions for generalized information

measures and present important results.

In Chapter 4, we discuss the geometrical aspects of relative-entropy minimization

and present an important result for Tsallis relative-entropy minimization.

In Chapter 5, we apply power-law distributions to selection mechanism in evolu-

tionary algorithms and test their novelty by simulations.

Finally, in Chapter 6, we summarize the contributions of this thesis, and discuss

possible future directions.

18

Page 28: Thesis Entropy Very Impotant

2 KN-averages and Entropies:Renyi’s Recipe

Abstract

This chapter builds the background for this thesis and introduces Renyi and Tsallis(nonextensive) generalizations of classical information measures. It also presentsa significant result on relation between Kolmogorov-Nagumo averages and nonex-tensive generalization, which can also be found in (Dukkipati, Murty, & Bhatnagar,2006b).

In recent years, interest in generalized information measures has increased dramati-

cally, after the introduction of nonextensive entropy in Physics by Tsallis (1988) (first

defined by Havrda and Charvat (1967)), and has been studied extensively in informa-

tion theory and statistics. One can get this nonextensive entropy or Tsallis entropy

by generalizing the information of a single event in the definition of Shannon entropy,

where logarithm is replaced with q-logarithm (defined as lnq x = x1−q−11−q ). The term

‘nonextensive’ is used because it does not satisfy the additivity property – a character-

istic property of Shannon entropy – instead, it satisfies pseudo-additivity of the form

x⊕q y = x+ y + (1 − q)xy.

Indeed, the starting point of the theory of generalized measures of information

is due to Renyi (1960, 1961), who introduced α-entropy or Renyi entropy, the first

formal generalization of Shannon entropy. Replacing linear averaging in Shannon

entropy, which can be interpreted as an average of information of a single event, with

Kolmogorov-Nagumo averages (KN-average) of the form 〈x〉ψ = ψ−1 (∑

k pkψ(xk)),

where ψ is an arbitrary continuous and strictly monotone function), and further impos-

ing the additivity constraint – a characteristic property of underlying information of a

single event – leads to Renyi entropy. Using this recipe of Renyi, one can prepare only

two information measures: Shannon and Renyi entropy. By means of this formalism,

Renyi characterized these additive entropies in terms of axioms of KN-averages.

One can view Renyi’s formalism as a tool, which can be used to generalize in-

formation measures and thereby characterize them using axioms of KN-averages. For

example, one can apply Renyi’s recipe in the nonextensive case by replacing the linear

19

Page 29: Thesis Entropy Very Impotant

averages in Tsallis entropy with KN-averages and thereby imposing the constraint of

pseudo-additivity. A natural question that arises is what are the pseudo-additive infor-

mation measures that one can prepare with this recipe? We prove that Tsallis entropy is

the only possible measure in this case, which allows us to characterize Tsallis entropy

using axioms of KN-averages.

As one can see from the above discussion, Hartley information measure (Hartley,

1928) of a single stochastic event plays a fundamental role in the Renyi and Tsallis

generalizations. Generalization of Renyi involves the generalization of linear average

in Shannon entropy, where as, in the case of Tsallis, it is the generalization of the

Hartley function; while Renyi’s is considered to be the additive generalization, Tsal-

lis is non-additive. These generalizations can be extended to Kullback-Leibler (KL)

relative-entropy too; indeed, many results presented in this thesis are related to gener-

alized relative entropies.

First, we discuss the important properties of classical information measures, Shan-

non and KL, in § 2.1. We discuss Renyi’s generalization in § 2.2, where we discuss

the Hartley function and properties of quasilinear means. Nonextensive generalization

of Shannon entropy and relative-entropy is presented in detail in § 2.3. Results on the

uniqueness of Tsallis entropy under Renyi’s recipe and characterization of nonexten-

sive information measures are presented in § 2.4 and § 2.5 respectively.

2.1 Classical Information Measures

In this section, we discuss the properties of two important classical information mea-

sures, Shannon entropy and Kullback-Leibler relative-entropy. We present the defi-

nitions in the discrete case; the same for the measure-theoretic case are presented in

the Chapter 3, where we discuss the maximum entropy prescriptions of information

measures.

We start with a brief note on the notation used in this chapter. Let X be a discrete

random variable (r.v) defined on some probability space, which takes only n values

and n <∞. We denote the set of all such random variables by X. We use the symbol

Y to denote a different set of random variables, say, those that take only m values

and m 6= n, m < ∞. Corresponding to the n-tuple (x1, . . . , xn) of values which X

takes, the probability mass function (pmf) of X is denoted by p = (p1, . . . pn), where

pk ≥ 0, k = 1, . . . n and∑n

k=1 pk = 1. Expectation of the r.v X is denoted by EX or

〈X〉; we use both the notations, interchangeably.

20

Page 30: Thesis Entropy Very Impotant

2.1.1 Shannon Entropy

Shannon entropy, a logarithmic measure of information of an r.v X ∈ X denoted by

S(X), reads as (Shannon, 1948)

S(X) = −n∑

k=1

pk ln pk . (2.1)

The convention that 0 ln 0 = 0 is followed, which can be justified by the fact that

limx→0 x lnx = 0. This formula was discovered independently by (Wiener, 1948),

hence, it is also known as Shannon-Wiener entropy.

Note that the entropy functional (2.1) is determined completely by the pmf p of r.v

X , and does not depend on the actual values that X takes. Hence, entropy functional

is often denoted as a function of pmf alone as S(p) or S(p1, . . . , pn); we use all these

notations, interchangeably, depending on the context. The logarithmic function in (2.1)

can be taken with respect to an arbitrary base greater than unity. In this thesis, we

always use the base e unless otherwise mentioned.

Shannon entropy of the Bernoulli variate is known as Shannon entropy function

which is defined as follows. Let X be a Bernoulli variate with pmf (p, 1 − p) where

0 < p < 1. Shannon entropy of X or Shannon entropy function is defined as

s(p) = S(p, 1 − p) = −p ln p− (1 − p) ln(1 − p) , p ∈ [0, 1] . (2.2)

s(p) attains its maximum value for p = 12 . Later, in this chapter we use this function

to compare Shannon entropy functional with generalized information measures, Renyi

and Tsallis, graphically.

Also, Shannon entropy function is of basic importance as Shannon entropy can be

expressed through it as follows:

S(p1, . . . , pn)= (p1 + p2)s

(p2

p1 + p2

)+ (p1 + p2 + p3)s

(p3

p1 + p2 + p3

)

+ . . .+ (p1 + . . .+ pn)s

(pn

p1 + . . .+ pn

)

=

n∑

k=2

(p1 + . . .+ pk)s

(pk

p1 + . . .+ pk

). (2.3)

We have already discussed some of the basic properties of Shannon entropy in

Chapter 1; here we state some properties formally. For a detailed list of properties

see (Aczel & Daroczy, 1975; Guiasu, 1977; Cover & Thomas, 1991; Topsøe, 2001).

21

Page 31: Thesis Entropy Very Impotant

S(p) ≥ 0, for any pmf p = (p1, . . . , pn) and assumes minimum value, S(p) = 0,

for a degenerate distribution, i.e., p(x0) = 1 for some x0 ∈ X , and p(x) = 0, ∀x ∈ X ,

x 6= x0. If p is not degenerate then S(p) is strictly positive. For any probability

distribution p = (p1, . . . , pn) we have

S(p) = S(p1, . . . , pn) ≤ S

(1

n, . . . ,

1

n

)= lnn . (2.4)

An important property of entropy functional S(p) is that it is a concave function of

p. This is a very useful property since a local maximum is also the global maximum

for a concave function that is subject to linear constraints.

Finally, the characteristic property of Shannon entropy can be stated as follows.

Let X ∈ X and Y ∈ Y be two random variables which are independent. Then we

have,

S(X × Y ) = S(X) + S(Y ) , (2.5)

where X × Y denotes joint r.v of X and Y . When X and Y are not necessarily

independent, then1

S(X × Y ) ≤ S(X) + S(Y ) , (2.6)

i.e., the entropy of the joint experiment is less than or equal to the sum of the uncer-

tainties of the two experiments. This is called the subadditivity property.

Many sets of axioms for Shannon entropy have been proposed. Shannon (1948)

has originally given a characterization theorem of the entropy introduced by him. A

more general and exact one is due to Hincin (1953), generalized by Faddeev (1986).

The most intuitive and compact axioms are given by Khinchin (1956), which are

known as the Shannon-Khinchin axioms. Faddeev’s axioms can be obtained as a spe-

cial case of Shannon-Khinchin axioms cf. (Guiasu, 1977, pp. 9, 63).

Here we list the Shannon-Khinchin axioms. Consider the sequence of functions

S(1), S(p1, p2), . . . , S(p1, . . . pn), . . ., where, for every n, the function S(p1, . . . , pn)

is defined on the set

P =

(p1, . . . , pn) | pi ≥ 0,

n∑

i=1

pi = 1

.

1This follows from the fact that S(X ×Y ) = S(X) +S(Y |X), and conditional entropy S(Y |X) ≤S(Y ), where

S(Y |X) = −ni=1

mj=1

p(xi, yj) ln p(yj |xi) .

22

Page 32: Thesis Entropy Very Impotant

Consider the following axioms:

[SK1] continuity: For any n, the function S(p1, . . . , pn) is continuous and symmetric

with respect to all its arguments,

[SK2] expandability: For every n, we have

S(p1, . . . , pn, 0) = S(p1, . . . , pn) ,

[SK3] maximality: For every n, we have the inequality

S(p1, . . . , pn) ≤ S

(1

n, . . . ,

1

n

),

[SK4] Shannon additivity: If

pij ≥ 0, pi =

mi∑

j=1

pij ∀i = 1, . . . , n, ∀j = 1, . . . ,mi, (2.7)

then the following equality holds:

S(p11, . . . , pnmn) = S(p1, . . . , pn) +

n∑

i=1

piS

(pi1

pi, . . . ,

pimipi

). (2.8)

Khinchin uniqueness theorem states that if the functional S : P → R satisfies the

axioms [SK1]-[SK4] then S is uniquely determined by

S(p1, . . . , pn) = −cn∑

k=1

pk ln pk ,

where c is any positive constant. Proof of this uniqueness theorem for Shannon entropy

can be found in (Khinchin, 1956) or in (Guiasu, 1977, Theorem 1.1, pp. 9).

2.1.2 Kullback-Leibler Relative-Entropy

Kullback and Leibler (1951) introduced relative-entropy or information divergence,

which measures the distance between two distributions of a random variable. This in-

formation measure is also known as KL-entropy, cross-entropy, I-divergence, directed

divergence, etc. (We use KL-entropy and relative-entropy interchangeably in this the-

sis.) KL-entropy of X ∈ X with pmf p with respect to Y ∈ X with pmf r is denoted

by I(X‖Y ) and is defined as

I(p‖r) = I(X‖Y ) =n∑

k=1

pk lnpk

rk, (2.9)

23

Page 33: Thesis Entropy Very Impotant

where one would assume that whenever rk = 0, the corresponding pk = 0 and 0 ln 00 =

0. Following Renyi (1961), if p and r are pmfs of the same r.v X , the relative-entropy

is sometimes synonymously referred to as the information gain about X achieved if p

can be used instead of r. KL-entropy as a distance measure on the space of all pmfs

of X is not a metric, since it is not symmetric, i.e., I(p‖r) 6= I(r‖p), and it does not

satisfy the triangle inequality.

KL-entropy is an important concept in information theory, since other information-

theoretic quantities including entropy and mutual information may be formulated as

special cases. For continuous distributions in particular, it overcomes the difficulties

with continuous version of entropy (known as differential entropy); its definition in

nondiscrete cases is a natural extension of the discrete case. These aspects constitute

the major discussion of Chapter 3 of this thesis.

Among the properties of KL-entropy, the property that I(p‖r) ≥ 0 and I(p‖r) = 0

if and only if p = r is fundamental in the theory of information measures, and is known

as the Gibbs inequality or divergence inequality (Cover & Thomas, 1991, pp. 26). This

property follows from Jensen’s inequality.

I(p‖r) is a convex function of both p and r. Further, it is a convex in the pair

(p, r), i.e., if (p1, r1) and (p2, q2) are two pairs of pmfs, then (Cover & Thomas, 1991,

pp. 30)

I(λp1 + (1 − λ)p2‖λr1 + (1 − λ)r2) ≤ λI(p1‖r1) + (1 − λ)I(p2‖r2) .(2.10)

Similar to Shannon entropy, KL-entropy is additive too in the following sense. Let

X1, X2 ∈ X and Y1, Y2 ∈ Y be such that X1 and Y1 are independent, and X2 and Y2

are independent, respectively, then

I(X1 × Y1‖X2 × Y2) = I(X1‖X2) + I(Y1‖Y2) , (2.11)

which is the additivity property2 of KL-entropy.

Finally, KL-entropy (2.9) and Shannon entropy (2.1) are related by

I(p‖r) = −S(p) −n∑

k=1

pk ln rk . (2.12)

2Additivity property of KL-entropy can alternatively be stated as follows. Let X and Y be twoindependent random variables. Let p(x, y) and r(x, y) be two possible joint pmfs of X and Y . Then wehave

I(p(x, y)‖r(x, y)) = I(p(x)‖r(x)) + I(p(y)‖r(y)) .

24

Page 34: Thesis Entropy Very Impotant

One has to note that the above relation between KL and Shannon entropies differs in

the nondiscrete cases, which we discuss in detail in Chapter 3.

2.2 Renyi’s Generalizations

Two important concepts that are essential for the derivation of Renyi entropy are Hart-

ley information measure and generalized averages known as Kolmogorov-Nagumo

averages. Hartley information measure quantifies the information associated with a

single event and brings forth the operational significance of the Shannon entropy – the

average of Hartley information is viewed as the Shannon entropy. Renyi used gener-

alized averages KN, in the averaging of Hartley information to derive his generalized

entropy. Before we summarize the information theory procedure leading to Renyi en-

tropy, we discuss these concepts in detail.

A conceptual discussion on significance of Hartley information in the definition

of Shannon entropy can be found in (Renyi, 1960) and more formal discussion can

be found in (Aczel & Daroczy, 1975, Chapter 0). Concepts related to generalized

averages can be found in the book on inequalities (Hardy et al., 1934, Chapter 3).

2.2.1 Hartley Function and Shannon Entropy

The motivation to quantify information in terms of logarithmic functions goes back

to Hartley (1928), who first used a logarithmic function to define uncertainty associ-

ated with a finite set. This is known as Hartley information measure. The Hartley

information measure of a finite set A with n elements is defined as H(A) = logb n. If

the base of the logarithm is 2, then the uncertainty is measured in bits, and in the case

of natural logarithm, the unit is nats. As we mentioned earlier, in this thesis, we use

only natural logarithm as a convention.

Hartley information measure resembles the measure of disorder in thermodynam-

ics, first provided by Boltzmann principle (known as Boltzmann entropy), and is given

by

S = K lnW , (2.13)

where K is the thermodynamic unit of measurement of entropy and is known as the

Boltzmann constant and W , called the degree of disorder or statistical weight, is the

total number of microscopic states compatible with the macroscopic state of the sys-

tem.

25

Page 35: Thesis Entropy Very Impotant

One can give a more general definition of Hartley information measure described

above as follows. Define a function H : x1, . . . , xn → R of the values taken by r.v

X ∈ X with corresponding p.m.f p = (p1, . . . pn) as (Aczel & Daroczy, 1975)

H(xk) = ln1

pk, ∀k = 1, . . . n. (2.14)

H is also known as information content or entropy of a single event (Aczel & Daroczy,

1975) and plays an important role in all classical measures of information. It can be

interpreted either as a measure of how unexpected the given event is, or as measure

of the information yielded by the event; and it has been called surprise by Watanabe

(1969), and unexpectedness by Barlow (1990).

Hartley function satisfies: (i) H is nonnegative: H(xk) ≥ 0 (ii) H is additive:

H(xi, xj) = H(xi) + H(xj), where H(xi, xj) = ln 1pipj

(iii) H is normalized:

H(xk) = 1, whenever pk = 1e (in the case of logarithm with base 2, the same is

satisfied for pk = 12 ). These properties are both necessary and sufficient (Aczel &

Daroczy, 1975, Theorem 0.2.5).

Now, Shannon entropy (2.1) can be written as expectation of Hartley function as

S(X) = 〈H〉 =

n∑

k=1

pkHk , (2.15)

where Hk = H(xk), ∀k = 1, . . . n, with the understanding that 〈H〉 = 〈H(X)〉.The characteristic additive property of Shannon entropy (2.5) now follows as a conse-

quence of the additivity property of Hartley function.

There are two postulates involved in defining Shannon entropy as expectation of

Hartley function. One is the additivity of information which is the characteristic prop-

erty of Hartley function, and the other is that if different amounts of information occur

with different probabilities, the total information will be the average of the individual

informations weighted by the probabilities of their occurrences. One can justify these

postulates by heuristic arguments based on probabilistic considerations, which can be

advanced to establish the logarithmic nature of Hartley and Shannon information mea-

sures (see § 1.2.1).

Expressing or defining Shannon entropy as an expectation of Hartley function, not

only provides an intuitive idea of Shannon entropy as a measure of information but it is

also useful in derivation of its properties. Further, as we are going to see in detail, this

provides a unified way to discuss the Renyi’s and Tsallis generalizations of Shannon

entropy.

Now we move on to a discussion on generalized averages.

26

Page 36: Thesis Entropy Very Impotant

2.2.2 Kolmogorov-Nagumo Averages or Quasilinear Means

In the general theory of means, the quasilinear mean of a random variable X ∈ X is

defined as3

EψX = 〈X〉ψ = ψ−1

(n∑

k=1

pkψ (xk)

), (2.16)

where ψ is continuous and strictly monotonic (increasing or decreasing) and hence

has an inverse ψ−1, which satisfies the same conditions. In the context of gener-

alized means, ψ is referred to as Kolmogorov-Nagumo function (KN-function). In

particular, if ψ is linear, then (2.16) reduces to the expression of linear averaging,

EX = 〈X〉 =∑n

k=1 pkxk. Also, the mean 〈X〉ψ takes the form of weighted arith-

metic mean (∑n

k=1 pkxak)

1a when ψ(x) = xa, a > 0 and geometric mean

∏nk=1 x

pkk if

ψ(x) = lnx.

In order to justify (2.16) as a so called mean we need the following theorem.

THEOREM 2.1 If ψ is continuous and strictly monotone in a ≤ x ≤ b, a ≤ xk ≤ b, k = 1, . . . n,

pk > 0 and∑n

k=1 pk = 1, then ∃ unique x0 ∈ (a, b) such that

ψ(x0) =n∑

k=1

pkψ(xk) , (2.17)

and x0 is greater than some and less than others of the xk unless all xk are zero.

The implication of Theorem 2.1 is that the mean 〈 . 〉ψ is determined when the

function ψ is given. One may ask whether the converse is true: if 〈X〉ψ1= 〈X〉ψ2

for

all X ∈ X, is ψ1 necessarily the same function as ψ2? Before answering this question,

we shall give the following definition.

DEFINITION 2.1 Continuous and strictly monotone functions ψ1 and ψ2 are said to be KN-equivalent

if 〈X〉ψ1= 〈X〉ψ2

for all X ∈ X.

3Kolmogorov (1930) and Nagumo (1930) first characterized the quasilinear mean for a vector(x1, . . . , xn) as 〈x〉

ψ= ψ−1 n

k=11nψ(xk) where ψ is a continuous and strictly monotone function.

de Finetti (1931) extended their result to the case of simple (finite) probability distributions. The versionof the quasilinear mean representation theorem referred to in § 2.5 is due to Hardy et al. (1934), whichfollowed closely the approach of de Finetti. Aczel (1948) proved a characterization of the quasilinearmean using functional equations. Ben-Tal (1977) showed that quasilinear means are ordinary arithmeticmeans under suitably defined addition and scalar multiplication operations. Norries (1976) did a surveyof quasilinear means and its more restrictive forms in Statistics, and a more recent survey of general-ized means can be found in (Ostasiewicz & Ostasiewicz, 2000). Applications of quasilinear means canbe found in economics (e.g., Epstein & Zin, 1989) and decision theory (e.g., Kreps & Porteus, 1978)).Recently Czachor and Naudts (2002) studied generalized thermostatistics based on quasilinear means.

27

Page 37: Thesis Entropy Very Impotant

Note that when we compare two means, it is to be understood that the underlying prob-

abilities are same. Now, the following theorem characterizes KN-equivalent functions.

THEOREM 2.2 In order that two continuous and strictly monotone functions ψ1 and ψ2 are KN-

equivalent, it is necessary and sufficient that

ψ1 = αψ2 + β , (2.18)

where α and β are constants and α 6= 0.

A simple consequence of the above theorem is that if ψ is a KN-function then we

have 〈X〉ψ = 〈X〉−ψ. Hence, without loss of generality, one can assume that ψ is

an increasing function. The following theorem states the important property of KN-

averages, which characterizes additivity of quasilinear means cf. (Hardy et al., 1934,

Theorem 84).

THEOREM 2.3 Let ψ be a KN-function and c be a real constant then 〈X + c〉ψ = 〈X〉ψ + c i.e.,

ψ−1

(n∑

k=1

pkψ (xk + c)

)= ψ−1

(n∑

k=1

pkψ (xk)

)+ c

if and only if ψ is either linear or exponential.

Proofs of Theorems 2.1, 2.2 and 2.3 can be found in the book on inequalities

by Hardy et al. (1934).

Renyi (1960) employed these generalized averages in the definition of Shannon

entropy to generalize the same.

2.2.3 Renyi Entropy

In the definition of Shannon entropy (2.15), if the standard mean of Hartley function

H is replaced with the quasilinear mean (2.16), one can obtain a generalized measure

of information of r.v X with respect to a KN-function ψ as

Sψ(X) = ψ−1

(n∑

k=1

pkψ

(ln

1

pk

))= ψ−1

(n∑

k=1

pkψ (Hk)

), (2.19)

where ψ is a KN-function. We refer to (2.19) as quasilinear entropy with respect to

the KN-function ψ. A natural question that arises is what is the possible mathematical

form of KN-function ψ, or in other words, what is the most general class of functions

ψ which will still provide a measure of information compatible with the additivity

28

Page 38: Thesis Entropy Very Impotant

property (postulate)? The answer is that insisting on additivity allows by Theorem 2.3

only for two classes of ψ’s – linear and exponential functions. We formulate these

arguments formally as follows.

If we impose the constraint of additivity on Sψ, i.e., for any X,Y ∈ X

Sψ(X × Y ) = Sψ(X) + Sψ(Y ) , (2.20)

then ψ should satisfy (Renyi, 1960)

〈X + c〉ψ = 〈X〉ψ + c , (2.21)

for any random variable X ∈ X and a constant c.

Renyi employed this formalism to define a one-parameter family of measures of

information as follows:

Sα(X) =1

1 − αln

(n∑

k=1

pαk

), (2.22)

where the KN-function ψ is chosen in (2.19) as ψ(x) = e(1−α)x whose choice is moti-

vated by Theorem 2.3. If we choose ψ as a linear function in quasilinear entropy (2.19),

what we get is Shannon entropy. The right side of (2.22) makes sense4 as a measure

of information whenever α 6= 1 and α > 0 cf. (Renyi, 1960).

Renyi entropy is a one-parameter generalization of Shannon entropy in the sense

that Sα(p) → S(p) as α → 1. Hence, Renyi entropy is referred to as entropy of order

α, whereas Shannon entropy is referred to as entropy of order 1. The Renyi entropy

can also be seen as an interpolation formula connecting the Shannon (α = 1) and

Hartley (α = 0) entropies.

Among the basic properties of Renyi entropy, Sα is positive. This follows from

Jensen’s inequality which gives∑n

k=1 pαk ≤ 1 in the case α > 1, and while in the case

0 < α < 1 it gives∑n

k=1 pαk ≥ 1; in both cases we have Sα(p) ≥ 0.

Sα is strictly concave with respect to p for 0 < α ≤ 1. For α > 1, Renyi

entropy is neither pure convex nor pure concave. This is a simple consequence of

the fact that both lnx and xα (α < 1) are concave functions, while xα is convex for

α > 1 (see (Ben-Bassat & Raviv, 1978) for proofs and a detailed discussion).4For negative α, however, Sα(p) has disadvantageous properties; namely, it will tend to infinity if

any pk tends to 0. This means that it is too sensitive to small probabilities. (This property could alsoformulated in the following way: if we add a new event of probability 0 to a probability distribution,what does not change the probability distribution, Sα(p) becomes infinity.) The case α = 0 must also beexcluded because it yields an expression not depending on the probability distribution p = (p1, . . . , pn).

29

Page 39: Thesis Entropy Very Impotant

A notable property of Sα(p) is that it is a monotonically decreasing function of α

for any pmf p. This can be verified as follows. I We can calculate the derivative of

Sα(p) with respect to α as

dSα(p)

dα=

1

(1 − α)

n∑

k=1

(pαk∑nj=1 p

αj

)ln pk +

1

(1 − α)2ln

n∑

k=1

pαk

=1

(1 − α)2

n∑

k=1

(pαk∑nj=1 p

αj

)ln p1−α

k − lnn∑

k=1

(pαk∑nj=1 p

αj

)p1−αk

.

(2.23)

One should note here that the vector of positive real numbers(

pα1 nj=1 p

αj, . . . ,

pαn nj=1 p

αj

)

represents a pmf. (Indeed, distributions of this form are known as escort distribu-

tions (Abe, 2003) and plays an important role in ME-prescriptions of Tsallis en-

tropy. We discuss these aspects in Chapter 3.) Denoting the mean of a vector x =

(x1, . . . , xn) with respect to this pmf, i.e. escort distribution of p, by 〈〈x〉〉α we can

write (2.23) in an elegant form, which further gives the results as

dSα(p)

dα=

1

(1 − α)2

〈〈ln p1−α〉〉α − ln 〈〈p1−α〉〉α

≤ 0 . (2.24)

The inequality in (2.24) is due to Jensen’s inequality. J Important consequences of the

fact that Sα is a monotone decreasing function of α are the following two inequalities

S1(p) < Sα(p) < lnn , 0 < α < 1, (2.25a)

Sα(p) < S1(p) < lnn , α > 1, (2.25b)

where S1(p) = limα→1 Sα(p) is the Shannon entropy.

From the derivation of Renyi entropy it is obvious that it is additive, i.e.,

Sα(X × Y ) = Sα(X) + Sα(Y ) , (2.26)

where X ∈ X and Y ∈ Y are two independent r.v.

Most of the other known properties of Renyi entropy and its characterizations are

summarized by Aczel and Daroczy (1975, Chapter 5) and Jizba and Arimitsu (2004b).

Properties related to convexity and bounds of Renyi entropy can be found in (Ben-

Bassat & Raviv, 1978).

30

Page 40: Thesis Entropy Very Impotant

Similar to the Shannon entropy function (2.2) one can define the entropy function

in the case of Renyi as

sα(p) =1

1 − αln(pα + (1 − p)α

), p ∈ [0, 1], (2.27)

which is the Renyi entropy of a Bernoulli random variable.

Figure 2.1 shows the plot of Shannon entropy function (2.2) compared to Renyi

entropy function (2.27) for various values of entropic index α.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

p

s(p

) &

sα(

p)

α=0.8

α=1.2

α=1.5

Shannon Renyi

Figure 2.1: Shannon and Renyi Entropy Functions

Renyi entropy does have a reasonable operational significance even if not one com-

parable with that of Shannon entropy cf. (Csiszar, 1974). As regards the axiomatic ap-

proach, Renyi (1961) did suggest a set of postulates characterizing his entropies but it

involved the rather artificial procedure of considering incomplete pdfs (∑n

k=1 pk ≤ 1 )

as well. This shortcoming has been eliminated by Daroczy (1970). Recently, a slightly

different set of axioms is given by (Jizba & Arimitsu, 2004b).

Despite its formal origin, Renyi entropy proved important in a variety of practical

applications in coding theory (Campbell, 1965; Aczel & Daroczy, 1975; Lavenda,

1998), statistical inference (Arimitsu & Arimitsu, 2000, 2001), quantum mechan-

ics (Maassen & Uffink, 1988), chaotic dynamics systems (Halsey, Jensen, Kadanoff,

Procaccia, & Shraiman, 1986) etc. Renyi entropy is also used in neural networks

(Kamimura, 1998). Thermodynamic properties of systems with multi-fractal struc-

tures have been studied by extending the notion of Gibbs-Shannon entropy into a more

31

Page 41: Thesis Entropy Very Impotant

general framework - Renyi entropy (Jizba & Arimitsu, 2004a).

Entropy of order 2 i.e., Renyi entropy for α = 2,

S2(p) = − ln

n∑

k=1

p2k (2.28)

is known as Renyi quadratic entropy. R‘enyi quadratic entropy is mostly used in

a contex of kernel based estimators, since it allows an explicit computation of the

estimated density. This measure has also been applied to clustering problems under

the name of information theoretic clustering (Gokcay & Principe, 2002). Maximum

entropy formulations of Renyi quadratic entropy are studied to compute conditional

probabilities, with applications to image retrieval and language modeling in the PhD

thesis of Zitnick (2003).

Along similar lines of generalization of entropy, Renyi (1960) defined a one pa-

rameter generalization of Kullback-Leibler relative-entropy as

Iα(p‖r) =1

α− 1ln

n∑

k=1

pαk

rα−1k

(2.29)

for pmfs p and r. Properties of this generalized relative-entropy can be found in (Renyi,

1970, Chapter 9).

We conclude this section with the note that though it is considered that the first

formal generalized measure of information is due to Renyi, the idea of considering

some generalized measure did not start with Renyi. Bhattacharyya (1943, 1946) and

Jeffreys (1948) dealt with the quantity

I1/2(p‖r) = −2

n∑

k=1

√pkrk = I1/2(r‖p) (2.30)

as a measure of difference between the distributions p and r, which is nothing but

Renyi relative-entropy (2.29) with α = 12 . Before Renyi, Schutzenberger (1954) men-

tioned the expression Sα and Kullback (1959) too dealt with the quantities Iα. (One

can refer (Renyi, 1960) for a discussion on the context in which Kullback considered

these generalized entropies.)

Apart from Renyi and Tsallis generalizations, there are various generalizations

of Shannon entropy reported in literature. Reviews of these generalizations can be

found in Kapur (1994) and Arndt (2001). The characterizations of various information

measures are studied in (Ebanks, Sahoo, & Sander, 1998). Since poorly motivated

generalizations have also been published during Renyi’s time, Renyi emphasized the

need of operational as well as postulational justification in order to call an algebraic

32

Page 42: Thesis Entropy Very Impotant

expression an information quantity. In this respect, Renyi’s review paper (Renyi, 1965)

is particularly instructive.

Now we discuss the important, non-additive generalization of Shannon entropy.

2.3 Nonextensive Generalizations

Although, first introduced by Havrda and Charvat (1967) in the context of cybernetics

theory and later studied by Daroczy (1970), it was Tsallis (1988) who exploited its

nonextensive features and placed it in a physical setting. Hence it is also known as

Harvda-Charvat-Daroczy-Tsallis entropy. (Throughout this paper we refer to this as

Tsallis or nonextensive entropy.)

2.3.1 Tsallis Entropy

Tsallis entropy of an r.v X ∈ X with p.m.f p = (p1, . . . pn) is defined as

Sq(X) =1 −∑n

k=1 pqk

q − 1, (2.31)

where q > 0 is called the nonextensive index.

Tsallis entropy too, like Renyi entropy, is a one-parameter generalization of Shan-

non entropy in the sense that

limq→1

Sq(p) = −n∑

k=1

pk ln pk = S1(p) , (2.32)

since in the limit q → 1, we have pq−1k = e(q−1) ln pk ∼ 1 + (q − 1) ln pk or by the

L’Hospital rule.

Tsallis entropy retains many important properties of Shannon entropy except for

the additivity property. Here we briefly discuss some of these properties. The argu-

ments which provide the positivity of Renyi entropy are also applicable for Tsallis

entropy and hence Sq(p) ≥ 0 for any pmf p. Sq equals zero in the case of certainty

and attains its extremum for a uniform distribution.

The fact that Tsallis entropy attains maximum for uniform distribution can be

shown as follows. I We extremize the Tsallis entropy under the normalizing con-

straint∑n

k=1 pk = 1. By introducing the Lagrange multiplier λ, we set

0 =∂

∂pk

(1 −∑n

k=1 pqk

q − 1− λ

(n∑

k=1

pk − 1

))= − q

q − 1pq−1k − λ .

33

Page 43: Thesis Entropy Very Impotant

It follows that

pk =

[λ(1 − q)

q

] 1q−1

.

Since this is independent of k, imposition of the normalizing constraint immediately

yields pk = 1n . J

Tsallis entropy is concave for all q > 0 (convex for q < 0). I This follows

immediately from the Hessian matrix

∂2

∂pi∂pj

(Sq(p) − λ

(n∑

k=1

pk − 1

))= −qpq−2

i δij ,

which is clearly negative definite for q > 0 (positive definite for q < 0). J One can

recall that Renyi entropy (2.22) is concave only for 0 < α < 1.

Also, one can prove that for two pmfs p and r, and for real number 0 ≤ λ ≤ 1 we

have

Sq(λp+ (1 − λ)r) ≥ λSq(p) + (1 − λ)Sq(r) , (2.33)

which results from Jensen’s inequality and concavity of xq

1−q .

What separates out Tsallis entropy from Shannon and Renyi entropies is that it is

not additive. The entropy index q in (2.31) characterizes the degree of nonextensivity

reflected in the pseudo-additivity property

Sq(X×Y ) = Sq(X)⊕qSq(Y ) = Sq(X)+Sq(Y )+(1−q)Sq(X)Sq(Y ) ,(2.34)

where X,Y ∈ X are two independent random variables.

In the nonextensive case, Tsallis entropy function can written as

sq(p) =1

q − 1

(1 − xq − (1 − x)q

)(2.35)

Figure 2.2 shows the plots of Shannon entropy function (2.2) and Tsallis entropy func-

tion (2.35) for various values of entropic index a.

It is worth mentioning here that the derivation of Tsallis entropy using the Lorentz

addition by Amblard and Vignat (2005) gives insights into the boundedness of Tsallis

entropy. In this thesis we will not go into these details.

The first set of axioms for Tsallis entropy is given by dos Santos (1997), which

were later improved by Abe (2000). The most concise set of axioms are given by Su-

yari (2004a), which are known as Generalized Shannon-Khinchin axioms. A simpli-

34

Page 44: Thesis Entropy Very Impotant

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

p

s(p)

& s

q(p)

q=0.8

q=1.2

q=1.5

Shannon Tsallis

Figure 2.2: Shannon and Tsallis Entropy Functions

fied proof of this uniqueness theorem for Tsallis entropy is given by (Furuichi, 2005).

In these axioms, Shannon additivity (2.8) is generalized to

Sq(p11, . . . , pnmn) = Sq(p1, . . . , pn) +n∑

i=1

pqiSq

(pi1

pi, . . . ,

pimipi

), (2.36)

under the same conditions (2.7); remaining axioms are the same as in Shannon-Khinchin

axioms.

Now we turn our attention to the nonextensive generalization of relative-entropy.

The definition of Kullback-Leibler relative-entropy (2.9) and the nonextensive entropic

functional (2.31) naturally lead to the generalization (Tsallis, 1998)

Iq(p‖r) =

n∑

k=1

pk

[pk

rk

]q−1

− 1

q − 1, (2.37)

which is called as Tsallis relative-entropy. The limit q → 1 recovers the relative-

entropy in the classical case. One can also generalize Gibbs inequality as (Tsallis,

1998) )

Iq(p‖r) ≥ 0 if q > 0

= 0 if q = 0

≤ 0 if q < 0 . (2.38)

35

Page 45: Thesis Entropy Very Impotant

For q 6= 0, the equalities hold if and only if p = r. (2.38) can be verified as follows.

I Consider the function f(x) = 1−x1−q

1−q . We have f ′′(x) > 0 for q > 0 and hence it

is convex. By Jensen’s inequality we obtain

Iq(p‖r) =n∑

k=1

pk

1 −(rkpk

)1−q

1 − q

≥ 1

1 − q

1 −

(n∑

k=1

pkrk

pk

)1−q

= 0 . (2.39)

For q < 0 we have f ′′(x) < 0 and hence we have the reverse inequality by Jensen’s

inequality for concave functions. J

Further, for q > 0, Iq(p‖r) is a convex function of p and r, and for q < 0 it

is concave, which can be proved using Jensen’s inequality cf. (Borland, Plastino, &

Tsallis, 1998).

Tsallis relative-entropy satisfies the pseudo-additivity property of the form (Fu-

ruichi et al., 2004)

Iq(X1 × Y1‖X2 × Y2) = Iq(X1‖X2) + Iq(Y1‖Y2)

+(q − 1)Iq(X1‖X2)Iq(Y1‖Y2) , (2.40)

where X1, X2 ∈ X and Y1, Y2 ∈ Y are such that X1 and Y1 are independent, and X2

and Y2 are independent respectively. The limit q → 1 in (2.40) retrieves (2.11), the ad-

ditivity property of Kullback-Leibler relative-entropy. One should note the difference

between the pseudo-additivities of Tsallis entropy (2.34) and Tsallis relative-entropy

(2.40).

Further properties of Tsallis relative-entropy have been discussed in (Tsallis, 1998;

Borland et al., 1998; Furuichi et al., 2004). Characterization of Tsallis relative-entropy,

by generalizing Hobson’s uniqueness theorem (Hobson, 1969) of relative-entropy, is

presented in (Furuichi, 2005).

2.3.2 q-Deformed Algebra

The mathematical basis for Tsallis statistics comes from the q-deformed expressions

for the logarithm (q-logarithm) and the exponential function (q-exponential) which

were first defined in (Tsallis, 1994), in the context of nonextensive thermostatistics.

The q-logarithm is defined as

lnq x =x1−q − 1

1 − q(x > 0, q ∈ R) , (2.41)

36

Page 46: Thesis Entropy Very Impotant

and the q-exponential is defined as

exq =

[1 + (1 − q)x]

11−q if 1 + (1 − q)x ≥ 0

0 otherwise.(2.42)

We have limq→1 lnq x = lnx and limq→1 exq = ex. These two functions are related by

elnq xq = x . (2.43)

The q-logarithm satisfies pseudo-additivity of the form

lnq(xy) = lnq x+ lnq y + (1 − q) lnq x lnq y , (2.44)

while, the q-exponential satisfies

exqeyq = e(x+y+(1−q)xy)

q . (2.45)

One important property of the q-logarithm is (Furuichi, 2006)

lnq

(x

y

)= yq−1(lnq x− lnq y) . (2.46)

These properties of q-logarithm and q-exponential functions, (2.44) and (2.45),

motivate the definition of q-addition as

x⊕q y = x+ y + (1 − q)xy , (2.47)

which we have already mentioned in the context of pseudo-additivity of Tsallis entropy

(2.34). The q-addition is commutative i.e., x ⊕q y = y ⊕q x, and associative i.e.,

x ⊕q (y ⊕q z) = (x ⊕q y) ⊕q z. But it is not distributive with respect to the usual

multiplication, i.e., a(x ⊕q y) 6= (ax ⊕q ay). Similar to the definition of q-addition,

the q-difference is defined as

xq y =x− y

1 + (1 − q)y, y 6= 1

q − 1. (2.48)

Further properties of these q-deformed functions can be found in (Yamano, 2002).

In this framework a new multiplication operation called q-product has been de-

fined, which plays an important role in the compact representation of distributions

resulting from Tsallis relative-entropy minimization (Dukkipati, Murty, & Bhatnagar,

2005b). These aspects are discussed in Chapter 4.

Now, using these q-deformed functions, Tsallis entropy (2.31) can be represented

as

Sq(p) = −n∑

k=1

pqk lnq pk , (2.49)

37

Page 47: Thesis Entropy Very Impotant

and Tsallis relative-entropy (2.37) as

Iq(p‖r) = −n∑

k=1

pk lnqrk

pk. (2.50)

These representations are very important for deriving many results related to nonex-

tensive generalizations as we are going to consider in the later chapters.

2.4 Uniqueness of Tsallis Entropy under Renyi’s Recipe

Though the derivation of Tsallis entropy proposed in 1988 is slightly different, one

can understand this generalization using the q-logarithm function, where one would

first generalize logarithm in the Hartley information with the q-logarithm and define

the q-Hartley function H : x1, . . . , xn → R of r.v X as (Tsallis, 1999)

Hk = H(xk) = lnq1

pk, k = 1, . . . n . (2.51)

Now, Tsallis entropy (2.31) can be defined as the expectation of the q-Hartley function

H as5

Sq(X) =⟨H⟩. (2.52)

Note that the characteristic pseudo-additivity property of Tsallis entropy (2.34) is a

consequence of the pseudo-additivity of the q-logarithm (2.44).

Before we present the main results, we briefly discuss the context of quasilinear

means, where there is a relation between Tsallis and Renyi entropy. By using the

definition of the q-logarithm (2.41), the q-Hartley function can be written as

Hk = lnq1

pk= φq(Hk) ,

where

φq(x) =e(1−q)x − 1

1 − q= lnq(e

x) . (2.53)

Note that the function φq is KN-equivalent to e(1−q)x (by Theorem 2.2), the KN-

function used in Renyi entropy. Hence Tsallis entropy is related to Renyi entropies

as

STq = φq(S

Rq ) , (2.54)

5There are alternative definitions of nonextensive information content in the Tsallis formalism. Oneof them is the expression − lnq pk used by Yamano (2001) and characterized by Suyari (2002) (notethat − lnq pk 6= lnq

1pk

). Using this definition one has to use alternate expectation, called q-expectation,to define Tsallis entropy. We discuss q-expectation values in Chapter 3. Regarding the definition ofnonextensive information content, we use Tsallis (1999) definition (2.51) in this thesis.

38

Page 48: Thesis Entropy Very Impotant

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9RenyiTsallis

SqR

(p)

& S

qT (

p)

q < 1

p

(a) Entropic Index q = 0.8

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

p

SqR

(p)

& S

qT (

p)

q > 1RenyiTsallis

(b) Entropic Index q = 1.2

Figure 2.3: Comparison of Renyi and Tsallis Entropy Functions

where STq and SR

q denote the Tsallis and Renyi entropies respectively with a real num-

ber q as a parameter. (2.54) implies that Tsallis and Renyi entropies are monotonic

functions of each other and, as a result, both must be maximized by the same probabil-

ity distribution. In this thesis, we consider only ME-prescriptions related to nonexten-

sive entropies. Discussion on ME of Renyi entropy can be found in (Bashkirov, 2004;

Johnson & Vignat, 2005; Costa, Hero, & Vignat, 2002).

Comparisons of Renyi entropy function (2.27) with Tsallis entropy function (2.35)

are shown graphically in Figure 2.3 for two cases of entropic index, corresponding to

0 < q < 1 and q > 1 respectively. Now a natural question that arises is whether

one could generalize Tsallis entropy using Renyi’s recipe, i.e. by replacing the linear

average in (2.52) by KN-averages and imposing the condition of pseudo-additivity. It

is equivalent to determining the KN-function ψ for which the so called q-quasilinear

39

Page 49: Thesis Entropy Very Impotant

entropy defined as

Sψ(X) =⟨H⟩ψ

= ψ−1

[n∑

k=1

pkψ(Hk

)], (2.55)

where Hk = H(xk), ∀k = 1, . . . n, satisfies the pseudo-additivity property.

First, we present the following result which characterizes the pseudo-additivity of

quasilinear means.

THEOREM 2.4 Let X,Y ∈ X be two independent random variables. Let ψ be any KN-function.

Then

〈X ⊕q Y 〉ψ = 〈X〉ψ ⊕q 〈Y 〉ψ (2.56)

if and only if ψ is linear.

Proof Let p and r be the p.m.fs of random variables X,Y ∈ X respectively. The proof of

sufficiency is simple and follows from

〈X ⊕q Y 〉ψ = 〈X ⊕q Y 〉 =

n∑

i=1

n∑

j=1

pirj(xi ⊕q yj)

=n∑

i=1

n∑

j=1

pirj(xi + yj + (1 − q)xiyj)

=

n∑

i=1

pixi +

n∑

j=1

rjyj + (1 − q)

n∑

i=1

pixi

n∑

j=1

rjyj .

To prove the converse, we need to determine all forms of ψ which satisfy

ψ−1

n∑

i=1

n∑

j=1

pirjψ (xi ⊕q yj)

= ψ−1

(n∑

i=1

piψ (xi)

)⊕q ψ

−1

n∑

j=1

rjψ (yj)

. (2.57)

Since (2.57) must hold for arbitrary p.m.fs p, r and for arbitrary numbers x1, . . . , xn

and y1, . . . , yn, one can choose yj = c for all j. Then (2.57) yields

ψ−1

(n∑

i=1

pkψ (xi ⊕q c)

)= ψ−1

(n∑

i=1

pkψ (xi)

)⊕q c . (2.58)

40

Page 50: Thesis Entropy Very Impotant

That is, ψ should satisfy

〈X ⊕q c〉ψ = 〈X〉ψ ⊕q c , (2.59)

for any X ∈ X and any constant c. This can be rearranged as

〈(1 + (1 − q)c)X + c〉ψ = (1 + (1 − q)c)〈X〉ψ + c

by using the definition of ⊕q. Since q is independent of other quantities, ψ should

satisfy an equation of the form

〈dX + c〉ψ = d〈X〉ψ + c , (2.60)

where d 6= 0 (by writing d = (1 + (1 − q)c)). Finally ψ must satisfy

〈X + c〉ψ = 〈X〉ψ + c (2.61)

and

〈dX〉ψ = d〈X〉ψ , (2.62)

for any X ∈ X and any constants d, c. From Theorem 2.3, condition (2.61) is satisfied

only when ψ is linear or exponential.

To complete the theorem, we have to show that KN-averages do not satisfy condi-

tion (2.62) when ψ is exponential. For a particular choice of ψ(x) = e(1−α)x, assume

that

〈dX〉ψ = d〈X〉ψ , (2.63)

where

〈dX〉ψ1=

1

1 − αln

(n∑

k=1

pke(1−α)dxk

),

and

d〈X〉ψ1=

d

1 − αln

(n∑

k=1

pke(1−α)xk

).

Now define a KN-function ψ′ as ψ′(x) = e(1−α)dx, for which

〈X〉ψ′ =1

d(1 − α)ln

(n∑

k=1

pke(1−α)dxk

).

Condition (2.63) implies

〈X〉ψ = 〈X〉ψ′ ,

and by Theorem 2.2, ψ and ψ′ are KN-equivalent, which gives a contradiction.

41

Page 51: Thesis Entropy Very Impotant

One can observe that the above proof avoids solving functional equations as in the

case of the proof of Theorem 2.3 (e.g., Aczel & Daroczy, 1975). Instead, it makes

use of Theorem 2.3 itself and other basic properties of KN-averages. The following

corollary is an immediate consequence of Theorem 2.4.

COROLLARY 2.1 The q-quasilinear entropy Sψ (defined as in (2.55)) with respect to a KN-function

ψ satisfies pseudo-additivity if and only if Sψ is Tsallis entropy.

Proof LetX,Y ∈ X be two independent random variables and let p, r be their corresponding

pmfs. By the pseudo-additivity constraint, ψ should satisfy

Sψ(X × Y ) = Sψ(X) ⊕q Sψ(Y ) . (2.64)

From the property of q-logarithm that lnq xy = lnq x⊕q lnq y, we need

ψ−1

n∑

i=1

n∑

j=1

pirjψ

(lnq

1

pirj

)

= ψ−1

(n∑

i=1

piψ

(lnq

1

pi

))⊕q ψ

−1

n∑

j=1

rjψ

(lnq

1

rj

) . (2.65)

Equivalently, we need

ψ−1

n∑

i=1

n∑

j=1

pirjψ(Hpi ⊕q H

rj

)

= ψ−1

(n∑

i=1

piψ(Hpi

))⊕q ψ

−1

n∑

j=1

rjψ(Hrj

) ,

where Hp and Hr represent the q-Hartley functions corresponding to probability dis-

tributions p and r respectively. That is, ψ should satisfy

〈Hp ⊕q Hr〉ψ = 〈Hp〉ψ ⊕q 〈Hr〉ψ .

Also from Theorem 2.4, ψ is linear and hence Sψ is Tsallis.

Corollary 2.1 shows that using Renyi’s recipe in the nonextensive case one can

prepare only Tsallis entropy, while in the classical case there are two possibilities.

Figure 2.4 summarizes the Renyi’s recipe for Shannon and Tsallis information mea-

sures.

42

Page 52: Thesis Entropy Very Impotant

Hartley Informationq−

q−

Hartley Information

Renyi Entropy Tsallis EntropyShannon Entropy ’

additivity

KN−average

pseudo−additivity

KN−average

Quasilinear EntropyQuasilinear Entropy

Figure 2.4: Renyi’s Recipe for Additive and Pseudo-additive Information Measures

2.5 A Characterization Theorem for Nonextensive Entropies

The significance of Renyi’s formalism to generalize Shannon entropy is a characteri-

zation of the set of all additive information measures in terms of axioms of quasilinear

means (Renyi, 1960). By the result, Theorem 2.4, that we presented in this chap-

ter, one can extend this characterization to pseudo-additive (nonextensive) information

measures. We emphasize here that, for such a characterization one would assume

that entropy is the expectation of a function of underlying r.v. In the classical case, the

function is Hartley function, while in the nonextensive case it is the q-Hartley function.

Since characterization of quasilinear means is given in terms of cumulative distri-

bution of a random variable as in (Hardy et al., 1934), we use the following definitions

and notation.

Let F : R → R denote the cumulative distribution function of the random variable

X ∈ X. Corresponding to a KN-function ψ : R → R, the generalized mean of F

(equivalently, generalized mean of X) can be written as

Eψ(F ) = Eψ(X) = 〈X〉ψ = ψ−1

(∫ ψ dF

), (2.66)

which is the continuous analogue to (2.16), and is axiomized by Kolmogorov, Nagumo,

de Finetti, c.f (Hardy et al., 1934, Theorem 215) as follows.

43

Page 53: Thesis Entropy Very Impotant

THEOREM 2.5 Let FI be the set of all cumulative distribution functions defined on some interval

I of the real line R. A functional κ : FI → R satisfies the following axioms:

[KN1] κ(δx) = x, where δx ∈ FI denotes the step function at x (Consistency with

certainty) ,

[KN2] F,G ∈ FI , if F ≤ G then κ(F ) ≤ κ(G); the equality holds if and only if

F = G (Monotonicity) and,

[KN3] F,G ∈ FI , if κ(F ) = κ(G) then κ(βF + (1− β)H) = κ(βG+ (1− β)H),

for any H ∈ FI (Quasilinearity)

if and only if there is a continuous strictly monotone function ψ such that

κ(F ) = ψ−1

(∫ ψ dF

).

Proof of the above characterization can be found in (Hardy et al., loc. cit.). Mod-

ified axioms for the quasilinear mean can be found in (Chew, 1983; Fishburn, 1986;

Ostasiewicz & Ostasiewicz, 2000). Using this characterization of the quasilinear mean,

Renyi gave the following characterization for additive information measures.

THEOREM 2.6 Let X ∈ X be a random variable. An information measure defined as a (gener-

alized) mean κ of Hartley function of X is either Shannon or Renyi if and only

if

1. κ satisfies axioms of quasilinear means [KN1]-[KN3] given in Theorem 2.5

and,

2. If X1, X2 ∈ X are two random variables which are independent, then

κ(X1 +X2) = κ(X1) + κ(X2) .

Further, if κ satisfies κ(Y ) + κ(−Y ) = 0 for any Y ∈ X then κ is necessarily

Shannon entropy.

The proof of above theorem is straight forward by using Theorem (2.3); for details

see (Renyi, 1960).

Now we give the following characterization theorem for nonextensive entropies.

44

Page 54: Thesis Entropy Very Impotant

THEOREM 2.7 Let X ∈ X be a random variable. An information measure defined as a (general-

ized) mean κ of q-Hartley function of X is Tsallis entropy if and only if

1. κ satisfies axioms of quasilinear means [KN1]-[KN3] given in Theorem 2.5

and,

2. If X1, X2 ∈ X are two random variables which are independent, then

κ(X1 ⊕q X2) = κ(X1) ⊕q κ(X2) .

The above theorem is a direct consequence of Theorems 2.4 and 2.5. This charac-

terization of Tsallis entropy only replaces the additivity constraint in the characteriza-

tion of Shannon entropy given by Renyi (1960) with pseudo-additivity, which further

does not make use of the postulate κ(X) + κ(−X) = 0. (This postulate is needed

to distinguish Shannon entropy from Renyi entropy). This is possible because Tsallis

entropy is unique by means of KN-averages and under pseudo-additivity.

From the relation between Renyi and Tsallis information measures (2.54), pos-

sibly, generalized averages play a role – though not very well understood till now –

in describing the operational significance of Tsallis entropy. Here, one should men-

tion the work of Czachor and Naudts (2002), who studied the KN-average based ME-

prescriptions of generalized information measures (constraints with respect to which

one would maximize entropy are defined in terms of quasilinear means). In this regard,

results presented in this chapter have mathematical significance in the sense that they

further the relation between nonextensive entropic measures and generalized averages.

45

Page 55: Thesis Entropy Very Impotant

3 Measures and Entropies:Gelfand-Yaglom-Perez Theorem

Abstract

The measure-theoretic KL-entropy defined as∫

Xln dP

dRdP , where P and R are

probability measures on a measurable space (X,M), plays a basic role in the defi-nitions of classical information measures. A fundamental theorem in this respect isthe Gelfand-Yaglom-Perez Theorem (Pinsker, 1960b, Theorem 2.4.2) which equipsmeasure-theoretic KL-entropy with a fundamental definition and can be stated as,

X

lndP

dRdP = sup

m∑

k=1

P (Ek) lnP (Ek)

R(Ek),

where supremum is taken over all the measurable partitions Ekm

k=1. In this chap-

ter, we state and prove the GYP-theorem for Renyi relative-entropy of order greaterthan one. Consequently, the result can be easily extended to Tsallis relative-entropy.Prior to this, we develop measure-theoretic definitions of generalized informationmeasures and discuss the maximum entropy prescriptions. Some of the results pre-sented in this chapter can also be found in (Dukkipati, Bhatnagar, & Murty, 2006b,2006a).

Shannon’s measure of information was developed essentially for the case when the

random variable takes a finite number of values. However in the literature, one often

encounters an extension of Shannon entropy in the discrete case (2.1) to the case of a

one-dimensional random variable with density function p in the form (e.g., Shannon

& Weaver, 1949; Ash, 1965)

S(p) = −∫ +∞

−∞p(x) ln p(x) dx .

This entropy in the continuous case as a pure-mathematical formula (assuming conver-

gence of the integral and absolute continuity of the density p with respect to Lebesgue

measure) resembles Shannon entropy in the discrete case, but cannot be used as a

measure of information for the following reasons. First, it is not a natural extension of

Shannon entropy in the discrete case, since it is not the limit of the sequence of finite

discrete entropies corresponding to pmfs which approximate the pdf p. Second, it is

not strictly positive.

46

Page 56: Thesis Entropy Very Impotant

Inspite of these short comings, one can still use the continuous entropy functional

in conjunction with the principle of maximum entropy where one wants to find a proba-

bility density function that has greater uncertainty than any other distribution satisfying

a set of given constraints. Thus, one is interested in the use of continuous measure as

a measure of relative and not absolute uncertainty. This is where one can relate maxi-

mization of Shannon entropy to the minimization of Kullback-Leibler relative-entropy

cf. (Kapur & Kesavan, 1997, pp. 55). On the other hand, it is well known that the

continuous version of KL-entropy defined for two probability density functions p and

r,

I(p‖r) =

∫ +∞

−∞p(x) ln

p(x)

r(x)dx ,

is indeed a natural generalization of the same in the discrete case.

Indeed, during the early stages of development of information theory, the important

paper by Gelfand, Kolmogorov, and Yaglom (1956) called attention to the case of

defining entropy functional on an arbitrary measure space (X,M, µ). In this case,

Shannon entropy of a probability density function p : X → R+ can be written as,

S(p) = −∫

Xp(x) ln p(x) dµ(x) .

One can see from the above definition that the concept of “the entropy of a pdf” is a

misnomer as there is always another measure µ in the background. In the discrete case

considered by Shannon, µ is the cardinality measure1 (Shannon & Weaver, 1949, pp.

19); in the continuous case considered by both Shannon and Wiener, µ is the Lebesgue

measure cf. (Shannon & Weaver, 1949, pp. 54) and (Wiener, 1948, pp. 61, 62). All

entropies are defined with respect to some measure µ, as Shannon and Wiener both

emphasized in (Shannon & Weaver, 1949, pp.57, 58) and (Wiener, 1948, pp.61, 62)

respectively.

This case was studied independently by Kallianpur (1960) and Pinsker (1960b),

and perhaps others were guided by the earlier work of Kullback and Leibler (1951),

where one would define entropy in terms of Kullback-Leibler relative-entropy. In

this respect, the Gelfand-Yaglom-Perez theorem (GYP-theorem) (Gelfand & Yaglom,

1959; Perez, 1959; Dobrushin, 1959) plays an important role as it equips measure-

theoretic KL-entropy with a fundamental definition. The main contribution of this

chapter is to prove GYP-theorem for Renyi relative-entropy of order α > 1, which can

be extended to Tsallis relative-entropy.1Counting or cardinality measure µ on a measurable space (X, ), where X is a finite set and = 2X , is defined as µ(E) = #E, ∀E ∈ .

47

Page 57: Thesis Entropy Very Impotant

Before proving GYP-theorem for Renyi relative-entropy, we study the measure-

theoretic definitions of generalized information measures in detail, and discuss the

corresponding ME-prescriptions. We show that as in the case of relative-entropy,

the measure-theoretic definitions of generalized relative-entropies, Renyi and Tsallis,

are natural extensions of their respective discrete definition. We also show that ME-

prescriptions of measure-theoretic Tsallis entropy are consistent with that of discrete

case, which is true for measure-theoretic Shannon-entropy.

We review the measure-theoretic formalisms for classical information measures in

§ 3.1 and extend these definitions to generalized information measures in § 3.2. In

§ 3.3 we present the ME-prescription for Shannon entropy followed by prescriptions

for Tsallis entropy in § 3.4. We revisit measure-theoretic definitions of generalized

entropy functionals in § 3.5 and present some results. Finally, Gelfand-Yaglom-Perez

theorem in the general case is presented in § 3.6.

3.1 Measure Theoretic Definitions of Classical Information Measures

In this section, we study the non-discrete definitions of entropy and KL-entropy and

present the formal definitions on the measure spaces. Rigorous studies of the Shannon

and KL entropy functionals in measure spaces can be found in the papers by Ochs

(1976) and Masani (1992a, 1992b). Basic measure-theoretic aspects of classical in-

formation measures can be found in books on information theory by Pinsker (1960b),

Guiasu (1977) and Gray (1990). For more details on development of mathematical

information theory one can refer to excellent survey by Kotz (1966). This survey is

perhaps the best available English-language guide to the Eastern European information

theory literature for the period 1956-1966. One can also refer to (Cover et al., 1989)

for a review on Kolmogorov’s contributions to mathematical information theory.

A note on the notation. To avoid proliferation of symbols we use the same no-

tation for the information measures in the discrete and non-discrete cases; the corre-

spondence should be clear from the context. For example, we use S(p) to denote the

entropy of a pdf p in the measure-theoretic setting too. Whenever we have to com-

pare these quantities in different cases we use the symbols appropriately, which will

be specified in the sequel.

3.1.1 Discrete to Continuous

Let p : [a, b] → R+ be a probability density function, where [a, b] ⊂ R. That is, p

48

Page 58: Thesis Entropy Very Impotant

satisfies

p(x) ≥ 0, ∀x ∈ [a, b] and

∫ b

ap(x) dx = 1 .

In trying to define entropy in the continuous case, the expression of Shannon entropy

in the discrete case (2.1) was automatically extended to continuous case by replacing

the sum in the discrete case with the corresponding integral. We obtain, in this way,

Boltzmann’s H-function (also known as differential entropy in information theory),

S(p) = −∫ b

ap(x) ln p(x) dx . (3.1)

The “continuous entropy” given by (3.1) is not a natural extension of definition in

discrete case in the sense that, it is not the limit of the finite discrete entropies cor-

responding to a sequence of finer partitions of the interval [a, b] whose norms tend

to zero. We can show this by a counter example. I Consider a uniform probability

distribution on the interval [a, b], having the probability density function

p(x) =1

b− a, x ∈ [a, b] .

The continuous entropy (3.1), in this case will be

S(p) = ln(b− a) .

On the other hand, let us consider a finite partition of the interval [a, b] which is com-

posed of n equal subintervals, and let us attach to this partition the finite discrete

uniform probability distribution whose corresponding entropy will be, of course,

Sn(p) = lnn .

Obviously, if n tends to infinity, the discrete entropy Sn(p) will tend to infinity too,

and not to ln(b− a); therefore S(p) is not the limit of Sn(p), when n tends to infinity.

J Further, one can observe that ln(b− a) is negative when b− a < 1.

Thus, strictly speaking, continuous entropy (3.1) cannot represent a measure of

uncertainty since uncertainty should in general be positive. We are able to prove the

“nice” properties only for the discrete entropy, therefore, it qualifies as a “good” mea-

sure of information (or uncertainty) supplied by a random experiment2 . We cannot2One importent property that Shannon entropy exhibits in the continuous case is the entropy power

inequality, which can be stated as follows. Let X and Y are continuous independent random variableswith entropies S(X) and S(Y ) then we have e2S(X+Y ) ≥ e2S(X) + e2S(Y ) with equality if and only ifX and Y are Gaussian variables or one of them is determenistic. The entropy power inequality is derivedby Shannon (1948). Only few and partial versions of it have been proved in the discrete case.

49

Page 59: Thesis Entropy Very Impotant

extend the so called nice properties to the “continuous entropy” because it is not the

limit of a suitably defined sequence of discrete entropies.

Also, in physical applications, the coordinate x in (3.1) represents an abscissa, a

distance from a fixed reference point. This distance x has the dimensions of length.

Since the density function p(x) specifies the probabilities of an event of type [c, d) ⊂[a, b] as

∫ dc p(x) dx and probabilities are dimensionless, one has to assign the dimen-

sions (length)−1 to p(x). Now for 0 ≤ z < 1, one has the series expansion

− ln(1 − z) = z +1

2z2 +

1

3z3 + . . . . (3.2)

It is thus necessary that the argument of the logarithmic function in (3.1) be dimen-

sionless. Hence the formula (3.1) is then seen to be dimensionally incorrect, since the

argument of the logarithm on its right hand side has the dimensions of a probability

density (Smith, 2001). Although, Shannon (1948) used the formula (3.1), he did note

its lack of invariance with respect to changes in the coordinate system.

In the context of maximum entropy principle, Jaynes (1968) addressed this prob-

lem and suggested the formula,

S′(p) = −∫ b

ap(x) ln

p(x)

m(x)dx , (3.3)

in the place of (3.1), where m(x) is a prior function. Note that when m(x) is also a

probability density function, (3.3) is nothing but the relative-entropy. However, if we

choose m(x) = c, a constant (e.g., Zellner & Highfield, 1988), we get

S′(p) = S(p) + ln c ,

where S(p) refers to the continuous entropy (3.1). Thus, maximization of S ′(p) is

equivalent to maximization of S(p). Further discussion on estimation of probability

density functions by maximum entropy method can be found in (Lazo & Rathie, 1978;

Zellner & Highfield, 1988; Ryu, 1993).

Prior to that, Kullback and Leibler (1951) too suggested that in the measure-

theoretic definition of entropy, instead of examining the entropy corresponding only

to the given measure, we have to compare the entropy inside a whole class of mea-

sures.

3.1.2 Classical Information Measures

Let (X,M, µ) be a measure space, where µ need not be a probability measure unless

otherwise specified. Symbols P , R will denote probability measures on measurable

50

Page 60: Thesis Entropy Very Impotant

space (X,M) and p, r denote M-measurable functions onX . An M-measurable func-

tion p : X → R+ is said to be a probability density function (pdf) if

∫X p(x) dµ(x) =

1 or∫X pdµ = 1 (henceforth, the argument x will be omitted in the integrals if this

does not cause ambiguity).

In this general setting, Shannon entropy S(p) of pdf p is defined as follows (Athreya,

1994).

DEFINITION 3.1 Let (X,M, µ) be a measure space and the M-measurable function p : X → R+ be

a pdf. Then, Shannon entropy of p is defined as

S(p) = −∫

Xp ln pdµ , (3.4)

provided the integral on right exists.

Entropy functional S(p) defined in (3.4) can be referred to as entropy of the prob-

ability measure P that is induced by p, that is defined according to

P (E) =

Ep(x) dµ(x) , ∀E ∈ M . (3.5)

This reference is consistent3 because the probability measure P can be identified a.e

by the pdf p.

Further, the definition of the probability measure P in (3.5), allows us to write

entropy functional (3.4) as,

S(p) = −∫

X

dP

dµln

dP

dµdµ , (3.6)

since (3.5) implies4 P µ, and pdf p is the Radon-Nikodym derivative of P w.r.t µ.

Now we proceed to the definition of Kullback-Leibler relative-entropy or KL-

entropy for probability measures.3Say p and r be two pdfs and P and R be the corresponding induced measures on measurable space

(X, ) such that P and R are identical, i.e., Epdµ =

Er dµ, ∀E ∈ . Then we have p = r, µ a.e,

and hence − Xp ln pdµ = −

Xr ln r dµ.

4If a nonnegative measurable function f induces a measure ν on measurable space (X, ) withrespect to a measure µ, defined as ν(E) =

Ef dµ, ∀E ∈ then ν µ. The converse of this result

is given by Radon-Nikodym theorem (Kantorovitz, 2003, pp.36, Theorem 1.40(b)).

51

Page 61: Thesis Entropy Very Impotant

DEFINITION 3.2 Let (X,M) be a measurable space. Let P and R be two probability measures on

(X,M). Kullback-Leibler relative-entropy KL-entropy of P relative toR is defined

as

I(P‖R) =

Xln

dP

dRdP if P R ,

+∞ otherwise.

(3.7)

The divergence inequality I(P‖R) ≥ 0 and I(P‖R) = 0 if and only if P = R

can be shown in this case too. KL-entropy (3.7) also can be written as

I(P‖R) =

X

dP

dRln

dP

dRdR . (3.8)

Let the σ-finite measure µ on (X,M) be such that P R µ. Since µ is

σ-finite, from Radon-Nikodym theorem, there exist non-negative M-measurable func-

tions p : X → R+ and r : X → R

+ unique µ-a.e, such that

P (E) =

Epdµ , ∀E ∈ M , (3.9a)

and

R(E) =

Er dµ , ∀E ∈ M . (3.9b)

The pdfs p and r in (3.9a) and (3.9b) (they are indeed pdfs) are Radon-Nikodym deriva-

tives of probability measures P and R with respect to µ, respectively, i.e., p = dPdµ and

r = dRdµ . Now one can define relative-entropy of pdf p w.r.t r as follows5.

DEFINITION 3.3 Let (X,M, µ) be a measure space. Let M-measurable functions p, r : X → R+

be two pdfs. The KL-entropy of p relative to r is defined as

I(p‖r) =

Xp(x) ln

p(x)

r(x)dµ(x) , (3.10)

provided the integral on right exists.

As we have mentioned earlier, KL-entropy (3.10) exists if the two densities are

absolutely continuous with respect to one another. On the real line, the same definition

can be written with respect to the Lebesgue measure

I(p‖r) =

∫ p(x) lnp(x)

r(x)dx ,

5This follows from the chain rule for Radon-Nikodym derivative:

dP

dR

a.e=

dP

dR

dµ −1

.

52

Page 62: Thesis Entropy Very Impotant

which exists if the densities p(x) and r(x) share the same support. Here, in the sequel

we use the convention

ln 0 = −∞, lna

0= +∞ forany a ∈ R, 0.(±∞) = 0. (3.11)

Now, we turn to the definition of entropy functional on a measure space. Entropy

functional in (3.6) is defined for a probability measure that is induced by a pdf. By

the Radon-Nikodym theorem, one can define Shannon entropy for any arbitrary µ-

continuous probability measure as follows.

DEFINITION 3.4 Let (X,M, µ) be a σ-finite measure space. Entropy of any µ-continuous probability

measure P (P µ) is defined as

S(P ) = −∫

Xln

dP

dµdP . (3.12)

The entropy functional (3.12) is known as Baron-Jauch entropy or generalized

Boltzmann-Gibbs-Shannon entropy (Wehrl, 1991). Properties of entropy of a proba-

bility measure in the Definition 3.4 are studied in detail by Ochs (1976). In the lit-

erature, one can find notation of the form S(P |µ) to represent the entropy functional

in (3.12) viz., the entropy of a probability measure, to stress the role of the measure

µ (e.g., Ochs, 1976; Athreya, 1994). Since all the information measures we define are

with respect to the measure µ on (X,M), we omit µ in the entropy functional notation.

By assuming µ as a probability measure in the Definition 3.4, one can relate Shan-

non entropy with Kullback-Leibler entropy as,

S(P ) = −I(P‖µ) . (3.13)

Note that when µ is not a probability measure, the divergence inequality I(P‖µ) ≥ 0

need not be satisfied.

A note on the σ-finiteness of measure µ in the definition of entropy functional. In

the definition of entropy functional we assumed that µ is a σ-finite measure. This con-

dition was used by Ochs (1976), Csiszar (1969) and Rosenblatt-Roth (1964) to tailor

the measure-theoretic definitions. For all practical purposes and for most applications,

this assumption is satisfied (see (Ochs, 1976) for a discussion on the physical inter-

pretation of measurable space (X,M) with σ-finite measure µ for an entropy measure

of the form (3.12), and of the relaxation of the σ-finiteness condition). The more uni-

versal definitions of entropy functionals, by relaxing the σ-finiteness condition, are

studied by Masani (1992a, 1992b).

53

Page 63: Thesis Entropy Very Impotant

3.1.3 Interpretation of Discrete and Continuous Entropies in terms of KL-entropy

First, let us consider the discrete case of (X,M, µ), where X = x1, . . . , xn, M =

2X is the power set of X . Let P and µ be any probability measures on (X,M). Then

µ and P can be specified as follows.

µ: µk = µ(xk) ≥ 0, k = 1, . . . , n,

n∑

k=1

µk = 1 , (3.14a)

and

P : Pk = P (xk) ≥ 0, k = 1, . . . , n,n∑

k=1

Pk = 1 . (3.14b)

The probability measure P is absolutely continuous with respect to the probability

measure µ if µk = 0 for some k ∈ 1, . . . , n then Pk = 0 as well. The corresponding

Radon-Nikodym derivative of P with respect to µ is given by

dP

dµ(xk) =

Pk

µk, k = 1, . . . n .

The measure-theoretic entropy S(P ) (3.12), in this case, can be written as

S(P ) = −n∑

k=1

Pk lnPk

µk=

n∑

k=1

Pk lnµk −n∑

k=1

Pk lnPk .

If we take referential probability measure µ as a uniform probability distribution on

the set X , i.e. µk = 1n , k = 1, . . . , n, we obtain

S(P ) = Sn(P ) − lnn , (3.15)

where Sn(P ) denotes the Shannon entropy (2.1) of pmf P = (P1, . . . , Pn) and S(P )

denotes the measure-theoretic entropy (3.12) reduced to the discrete case, with the

probability measures µ and P specified as in (3.14a) and (3.14b) respectively.

Now, let us consider the continuous case of (X,M, µ), where X = [a, b] ⊂ R, M

is the σ-algebra of Lebesgue measurable subsets of [a, b]. In this case µ and P can be

specified as follows.

µ: µ(x) ≥ 0, x ∈ [a, b],3 µ(E) =

Eµ(x) dx,∀E ∈ M,

∫ b

aµ(x) dx = 1 ,

(3.16a)

and

P : P (x) ≥ 0, x ∈ [a, b],3 P (E) =

EP (x) dx,∀E ∈ M,

∫ b

aP (x) dx = 1 .

(3.16b)

54

Page 64: Thesis Entropy Very Impotant

Note that the abuse of notation in the above specification of probability measures µ

and P , where we have used the same symbols for both measures and pdfs, is in order

to have the notation consistent with the discrete case analysis given above. The proba-

bility measure P is absolutely continuous with respect to the probability measure µ, if

µ(x) = 0 on a set of a positive Lebesgue measure implies that P (x) = 0 on the same

set. The Radon-Nikodym derivative of the probability measure P with respect to the

probability measure µ will be

dP

dµ(x) =

P (x)

µ(x).

We emphasize here that this relation can only be understood with the above (abuse

of) notation explained. Then the measure-theoretic entropy S(P ) in this case can be

written as

S(P ) = −∫ b

aP (x) ln

P (x)

µ(x)dx .

If we take referential probability measure µ as a uniform distribution, i.e. µ(x) = 1b−a ,

x ∈ [a, b], then we obtain

S(P ) = S[a,b](P ) − ln(b− a) , (3.17)

where S[a,b](P ) denotes the Shannon entropy (3.1) of pdf P (x) and S(P ) denotes the

measure-theoretic entropy (3.12) reduced to the continuous case, with the probability

measures µ and P specified as in (3.16a) and (3.16b) respectively.

Hence, one can conclude that measure theoretic entropy S(P ) defined for a proba-

bility measure P on the measure space (X,M, µ), is equal to both Shannon entropy in

the discrete and continuous case up to an additive constant, when the reference mea-

sure µ is chosen as a uniform probability distribution. On the other hand, one can see

that measure-theoretic KL-entropy, in the discrete and continuous cases corresponds

to its discrete and continuous definitions.

Further, from (3.13) and (3.15), we can write Shannon entropy in terms of Kullback-

Leibler relative-entropy as

Sn(P ) = lnn− I(P‖µ) . (3.18)

Thus, Shannon entropy appears as being (up to an additive constant) the variation

of information when we pass from the initial uniform probability distribution to new

probability distribution given by Pk ≥ 0,∑n

k=1 Pk = 1, as any such probability

distribution is obviously absolutely continuous with respect to the uniform discrete

55

Page 65: Thesis Entropy Very Impotant

probability distribution. Similarly, from (3.13) and (3.17) the relation between Shan-

non entropy and relative-entropy in continuous case can be obtained, and we can write

Boltzmann H-function in terms of relative-entropy as

S[a,b](p) = ln(b− a) − I(P‖µ) . (3.19)

Therefore, the continuous entropy or Boltzmann H-function S(p) may be interpreted

as being (up to an additive constant) the variation of information when we pass from

the initial uniform probability distribution on the interval [a, b] to the new probability

measure defined by the probability distribution function p(x) (any such probability

measure is absolutely continuous with respect to the uniform probability distribution

on the interval [a, b]).

From the above discussion one can see that KL-entropy equips one with unitary

interpretation of both discrete entropy and continuous entropy. One can utilize Shan-

non entropy in the continuous case, as well as Shannon entropy in the discrete case,

both being interpreted as the variation of information when we pass from the initial

uniform distribution to the corresponding probability measure.

Also, since measure theoretic entropy is equal to the discrete and continuous en-

tropy up to an additive constant, ME-prescriptions of measure-theoretic Shannon en-

tropy are consistent with both the discrete and continuous cases.

3.2 Measure-Theoretic Definitions of Generalized Information Measures

In this section we extend the measure-theoretic definitions to generalized information

measures discussed in Chapter 2. We begin with a brief note on the notation and

assumptions used.

We define all the information measures on the measurable space (X,M). The

default reference measure is µ unless otherwise stated. For simplicity in exposition,

we will not distinguish between functions differing on a µ-null set only; nevertheless,

we can work with equations between M-measurable functions on X if they are stated

as being valid only µ-almost everywhere (µ-a.e or a.e). Further we assume that all

the quantities of interest exist and also assume, implicitly, the σ-finiteness of µ and

µ-continuity of probability measures whenever required. Since these assumptions re-

peatedly occur in various definitions and formulations, these will not be mentioned

in the sequel. With these assumptions we do not distinguish between an information

measure of pdf p and that of the corresponding probability measure P – hence when

56

Page 66: Thesis Entropy Very Impotant

we give definitions of information measures for pdfs, we also use the corresponding

definitions of probability measures as well, wherever convenient or required – with

the understanding that P (E) =∫E pdµ, and the converse holding as a result of the

Radon-Nikodym theorem, with p = dPdµ . In both the cases we have P µ.

With these notations we move on to the measure-theoretic definitions of general-

ized information measures. First we consider the Renyi generalizations. The measure-

theoretic definition of Renyi entropy is as follows.

DEFINITION 3.5 Renyi entropy of a pdf p : X → R+ on a measure space (X,M, µ) is defined as

Sα(p) =1

1 − αln

Xp(x)α dµ(x) , (3.20)

provided the integral on the right exists and α ∈ R, α > 0.

The same can also be defined for any µ-continuous probability measure P as

Sα(P ) =1

1 − αln

X

(dP

)α−1

dP . (3.21)

On the other hand, Renyi relative-entropy can be defined as follows.

DEFINITION 3.6 Let p, r : X → R+ be two pdfs on a measure space (X,M, µ). The Renyi relative-

entropy of p relative to r is defined as

Iα(p‖r) =1

α− 1ln

X

p(x)α

r(x)α−1dµ(x) , (3.22)

provided the integral on the right exists and α ∈ R, α > 0.

The same can be written in terms of probability measures as,

Iα(P‖R)=1

α− 1ln

X

(dP

dR

)α−1

dP

=1

α− 1ln

X

(dP

dR

)αdR , (3.23)

whenever P R; Iα(P‖R) = +∞, otherwise. Further if we assume µ in (3.21) is a

probability measure then

Sα(P ) = Iα(P‖µ) . (3.24)

The Tsallis entropy in the measure-theoretic setting can be defined as follows.

57

Page 67: Thesis Entropy Very Impotant

DEFINITION 3.7 Tsallis entropy of a pdf p on (X,M, µ) is defined as

Sq(p) =

Xp(x) lnq

1

p(x)dµ(x) =

1 −∫X p(x)

q dµ(x)

q − 1, (3.25)

provided the integral on the right exists and q ∈ R, q > 0.

The q-logarithm lnq is defined as in (2.41). The same can be defined for µ-

continuous probability measure P , and can be written as

Sq(P ) =

Xlnq

(dP

)−1

dP . (3.26)

The definition of Tsallis relative-entropy is given below.

DEFINITION 3.8 Let (X,M, µ) be a measure space. Let p, r : X → R+ be two probability density

functions. The Tsallis relative-entropy of p relative to r is defined as

Iq(p‖r) = −∫

Xp(x) lnq

r(x)

p(x)dµ(x) =

∫X

p(x)q

r(x)q−1 dµ(x) − 1

q − 1(3.27)

provided the integral on the right exists and q ∈ R, q > 0.

The same can be written for two probability measures P and R, as

Iq(P‖R) = −∫

Xlnq

(dP

dR

)−1

dP , (3.28)

whenever P R; Iq(P‖R) = +∞, otherwise. If µ in (3.26) is a probability measure

then

Sq(P ) = Iq(P‖µ) . (3.29)

We shall revisit these measure-theoretic definitions in § 3.5.

3.3 Maximum Entropy and Canonical Distributions

For all the ME-prescriptions of classical information measures we consider the set of

constraints of the form∫

Xum dP =

Xum(x)p(x) dµ(x) = 〈um〉 , m = 1, . . . ,M , (3.30)

with respect to M-measurable functions um : X → R, m = 1, . . . ,M , whose

expectation values 〈um〉, m = 1, . . . ,M are (assumed to be) a priori known, along

58

Page 68: Thesis Entropy Very Impotant

with the normalizing constraint∫X dP = 1. (From now on we assume that any set of

constraints on probability distributions implicitly includes this constraint, which will

therefore not be mentioned in the sequel.)

To maximize the entropy (3.4) with respect to the constraints (3.30), the solution

is calculated via the Lagrangian:

L(x, λ, β) = −∫

Xln

dP

dµ(x) dP (x) − λ

(∫

XdP (x) − 1

)

−M∑

m=1

βm

(∫

Xum(x) dP (x) − 〈um〉

), (3.31)

where λ and βm, m = 1, . . . ,M are Lagrange parameters (we use the notation β =

(β1, . . . , βM )). The solution is given by

lndP

dµ(x) + λ+

M∑

m=1

βmum(x) = 0 .

The solution can be calculated as

dP (x) = exp

(− lnZ(β) −

M∑

m=1

βmum(x)

)dµ(x) (3.32)

or

p(x) =dP

dµ(x) =

e− Mm=1 βmum(x)

Z(β), (3.33)

where the partition function Z(β) is written as

Z(β) =

Xexp

(−

M∑

m=1

βmum(x)

)dµ(x) . (3.34)

The Lagrange parameters βm, m = 1, . . . M are specified by the set of constraints

(3.30).

The maximum entropy, denoted by S, can be calculated as

S = lnZ +

M∑

m=1

βm〈um〉 . (3.35)

The Lagrange parameters βm, m = 1, . . . M , are calculated by searching the

unique solution (if it exists) of the following system of nonlinear equations:

∂βmlnZ(β) = −〈um〉 , m = 1, . . . M. (3.36)

We also have

∂S

∂〈um〉= βm , m = 1, . . . M. (3.37)

Equations (3.36) and (3.37) are referred to as the thermodynamic equations.

59

Page 69: Thesis Entropy Very Impotant

3.4 ME-prescription for Tsallis Entropy

As we mentioned earlier, the great success of Tsallis entropy is attributed to the power-

law distributions that result from the ME-prescriptions of Tsallis entropy. But there are

subtleties involved in the choice of constraints one would choose for ME prescriptions

of these entropy functionals. The issue of what kind of constraints one should use in

the ME-prescriptions is still a part of the major discussion in the nonextensive formal-

ism (Ferri et al., 2005; Abe & Bagci, 2005; Wada & Scarfone, 2005).

In the nonextensive formalism, maximum entropy distributions are derived with

respect to the constraints that are different from (3.30), and are inadequate for han-

dling the serious mathematical difficulties that result for instance, those of unwanted

divergences etc. cf. (Tsallis, 1988). To handle these difficulties constraints of the form∫

Xum(x)p(x)q dµ(x) = 〈um〉q ,m = 1, . . . ,M (3.38)

are proposed by Curado and Tsallis (1991). The averages of the form 〈um〉q are re-

ferred to as q-expectations.

3.4.1 Tsallis Maximum Entropy Distribution

To calculate the maximum Tsallis entropy distribution with respect to the constraints

(3.38), the Lagrangian can be written as

L(x, λ, β) =

Xlnq

1

p(x)dP (x) − λ

(∫

XdP (x) − 1

)

−M∑

m=1

βm

(∫

Xp(x)q−1um(x) dP (x) − 〈um〉q

). (3.39)

The solution is given by

lnq1

p(x)− λ−

M∑

m=1

βmum(x)p(x)q−1 = 0 . (3.40)

By the definition of q-logarithm (2.41), (3.40) can be rearranged as

p(x) =

[1 − (1 − q)

∑Mm=1 βmum(x)

] 11−q

(λ(1 − q) + 1)1

1−q

. (3.41)

60

Page 70: Thesis Entropy Very Impotant

The denominator in (3.41) can be calculated using the normalizing constraint∫X dP =

1. Finally, Tsallis maximum entropy distribution can be written as

p(x) =

[1 − (1 − q)

∑Mm=1 βmum(x)

] 11−q

Zq, (3.42)

where the partition function is

Zq =

X

[1 − (1 − q)

M∑

m=1

βmum(x)

] 11−q

dµ(x) . (3.43)

Tsallis maximum entropy distribution (3.42) can be expressed in terms of the q-expectation

function (2.42) as

p(x) =e−

Mm=1 βmum(x)

q

Zq. (3.44)

Note that in order to guarantee that pdf p in (3.42) is non-negative real for any

x ∈ X , it is necessary to supplement it with an appropriate prescription for treating

negative values of the quantity[1 − (1 − q)

∑Mm=1 βmum(x)

]. That is, we need a

prescription for the value of p(x) when

[1 − (1 − q)

M∑

m=1

βmum(x)

]< 0 . (3.45)

The simplest possible prescription, and the one usually adopted, is to set p(x) = 0

whenever the inequality (3.45) holds (Tsallis, 1988; Curado & Tsallis, 1991). This

rule is known as the Tsallis cut-off condition. Simple extensions of Tsallis cut-off

conditions are proposed in (Teweldeberhan et al., 2005) by defining an alternate q-

exponential function. In this thesis, we consider only the usual Tsallis cut-off con-

dition mentioned above. Note that by expressing Tsallis maximum entropy distribu-

tion (3.42) in terms of the q-exponential function, as in (3.44), we have assumed Tsallis

cut-off condition implicitly. In summary, when we refer to Tsallis maximum entropy

distribution we mean the following

p(x) =

e−

Mm=1 βmum(x)

q

Zqif[1 − (1 − q)

∑Mm=1 βmum(x)

]> 0

0 otherwise.

(3.46)

Maximum Tsallis entropy can be calculated as (Curado & Tsallis, 1991),

Sq = lnZq +

M∑

m=1

βm〈um〉q . (3.47)

61

Page 71: Thesis Entropy Very Impotant

The corresponding thermodynamic equations are as follows (Curado & Tsallis, 1991):

∂βmlnq Zq = −〈um〉q , m = 1, . . . M, (3.48)

∂Sq

∂〈um〉q= βm , m = 1, . . . M. (3.49)

It may be interesting to compare these equations with their classical counterparts,

(3.36) and (3.37), to see the consistency in generalizations.

Here we mention that some important mathematical properties of nonextensive

maximum entropy distribution (3.42) for q = 12 has been studied and reported by Rebollo-

Neira (2001) with applications to data subset selection. One can refer to (Vignat, Hero,

& Costa, 2004) for a study of Tsallis maximum entropy distributions in the multivari-

ate case.

3.4.2 The Case of Normalized q-expectation values

Constraints of the form (3.38) had been used for some time in the nonextensive ME-

prescriptions, but because of problems in justifying it on physical grounds (for example

q-expectation of a constant need not be a constant and hence they are not expecta-

tions in the true sense) the constraints of the following form were proposed in (Tsallis,

Mendes, & Plastino, 1998)∫X um(x)p(x)q dµ(x)∫

X p(x)q dµ(x)

= 〈〈um〉〉q ,m = 1, . . . ,M . (3.50)

Here 〈〈um〉〉q can be considered as the expectation of um with respect to the modified

probability measure P(q) (it is indeed a probability measure) defined as

P(q)(E) =

(∫

Xp(x)q dµ(x)

)−1 ∫

Ep(x)q dµ(x) , ∀E ∈ M . (3.51)

The modified probability measure P(q) is known as the escort probability measure (Tsal-

lis et al., 1998).

Now, the variational principle for Tsallis entropy maximization with respect to

constraints (3.50) can be written as

L(x, λ, β) =

Xlnq

1

p(x)dP (x) − λ

(∫

XdP (x) − 1

)

−M∑

m=1

β(q)m

(∫

Xp(x)q−1

(um(x) − 〈〈um〉〉q

)dP (x)

), (3.52)

62

Page 72: Thesis Entropy Very Impotant

where the parameters β(q)m can be defined in terms of true Lagrange parameters βm as

β(q)m =

βm∫

Xp(x)q dµ(x)

, m = 1, . . . ,M. (3.53)

The maximum entropy distribution in this case turns out to be

p(x) =

1 − (1 − q)

∑Mm=1 βm

(um(x) − 〈〈um〉〉q

)

∫X p(x)

qdµ(x)

11−q

Zq.

(3.54)

This can be written using q-exponential functions as

p(x) =1

Zqexpq

∑Mm=1 βm

(um(x) − 〈〈um〉〉q

)

∫X p(x)

q dµ(x)

, (3.55)

where

Zq =

Xexpq

∑Mm=1 βm

(um(x) − 〈〈um〉〉q

)

∫X p(x)

q dµ(x)

dµ(x) . (3.56)

Maximum Tsallis entropy Sq in this case satisfies

Sq = lnq Zq , (3.57)

while the corresponding thermodynamic equations are

∂βmlnq Zq = −〈〈um〉〉q , m = 1, . . . M , (3.58)

∂Sq

∂〈〈um〉〉q= βm , m = 1, . . . M , (3.59)

where

lnq Zq = lnq Zq −M∑

m=1

βm〈〈um〉〉q . (3.60)

3.5 Measure-Theoretic Definitions Revisited

It is well known that unlike Shannon entropy, Kullback-Leibler relative-entropy in

the discrete case can be extended naturally to the measure-theoretic case by a simple

63

Page 73: Thesis Entropy Very Impotant

limiting process cf. (Topsøe, 2001, Theorem 5.2). In this section, we show that this

fact is true for generalized relative-entropies too. Renyi relative-entropy on continuous

valued space R and its equivalence with the discrete case is studied by Renyi (1960),

Jizba and Arimitsu (2004b). Here, we present the result in the measure-theoretic case

and conclude that measure-theoretic definitions of both Tsallis and Renyi relative-

entropies are equivalent to their respective entities.

We also present a result pertaining to ME of measure-theoretic Tsallis entropy. We

prove that ME of Tsallis entropy in the measure-theoretic case is consistent with the

discrete case.

3.5.1 On Measure-Theoretic Definitions of Generalized Relative-Entropies

Here we show that generalized relative-entropies in the discrete case can be naturally

extended to measure-theoretic case, in the sense that measure-theoretic definitions can

be defined as limits of sequences of finite discrete entropies of pmfs which approximate

the pdfs involved. We refer to any such sequence of pmfs as “approximating sequence

of pmfs of a pdf”. To formalize these aspects we need the following lemma.

LEMMA 3.1 Let p be a pdf defined on measure space (X,M, µ). Then there exists a sequence

of simple functions fn (approximating sequence of simple functions of p) such

that limn→∞ fn = p and each fn can be written as

fn(x) =1

µ(En,k)

En,k

pdµ , ∀x ∈ En,k , k = 1, . . . m(n) , (3.61)

where En,km(n)k=1 , is the measurable partition of X corresponding to fn (the nota-

tion m(n) indicates that m varies with n). Further each fn satisfies∫

Xfn dµ = 1 . (3.62)

Proof Define a sequence of simple functions fn as

fn(x) =

1

µ(p−1([ k2n ,k+12n )))

p−1([ k2n ,k+12n ))

pdµ , if k2n ≤ p(x) < k+1

2n ,

k = 0, 1, . . . n2n − 1 ,

1

µ(p−1([n,∞)))

p−1([n,∞))pdµ , if n ≤ p(x) .

(3.63)

64

Page 74: Thesis Entropy Very Impotant

Each fn is indeed a simple function and can be written as

fn =

n2n−1∑

k=0

(1

µ(En,k)

En,k

pdµ

)χEn,k +

(1

µ(Fn)

Fn

pdµ

)χFn , (3.64)

where En,k = p−1([

k2n ,

k+12n

)), k = 0, . . . , n2n−1 and Fn = p−1 ([n,∞)). Also, for

any measurable set E ∈ M, χE : X → 0, 1 denotes its indicator or characteristic

function. Note that En,0, . . . , En,n2n−1, Fn is indeed a measurable partition of X ,

for any n. Since∫E pdµ < ∞ for any E ∈ M, we have

∫En,k

pdµ = 0 whenever

µ(En,k) = 0, for k = 0, . . . n2n − 1. Similarly∫Fnpdµ = 0 whenever µ(Fn) = 0.

Now we show that limn→∞ fn = p, point-wise.

Since p is a pdf, we have p(x) <∞. Then ∃n ∈ Z+ 3 p(x) ≤ n. Also ∃ k ∈ Z

+,

0 ≤ k ≤ n2n − 1 3 k2n ≤ p(x) < k+1

2n and k2n ≤ fn(x) <

k+12n . This implies

0 ≤ |p− fn| < 12n as required.

(Note that this lemmma holds true even if p is not a pdf. This follows from, if

p(x) = ∞, for some x ∈ X , then x ∈ Fn for all n, and therefore fn(x) ≥ n for all n;

hence limn→∞ fn(x) = ∞ = p(x).)

Finally, we have

Xfn dµ=

n(m)∑

k=1

[1

µ(En,k)

En,k

pdµ

]µ(En,k) +

[1

µ(Fn)

Fn

pdµ

]µ(Fn)

=

n(m)∑

k=1

En,k

pdµ+

Fn

pdµ

=

Xpdµ = 1 .

The above construction of a sequence of simple functions which approximate a

measurable function is similar to the approximation theorem (e.g., Kantorovitz, 2003,

pp.6, Theorem 1.8(b)) in the theory of integration. But, approximation in Lemma 3.1

can be seen as a mean-value approximation whereas in the above case it is the lower

approximation. Further, unlike in the case of lower approximation, the sequence of

simple functions which approximate p in Lemma 3.1 are neither monotone nor satisfy

fn ≤ p.

Now one can define a sequence of pmfs pn corresponding to the sequence of

simple functions constructed in Lemma 3.1, denoted by pn = (pn,1, . . . , pn,m(n)), as

pn,k = µ(En,k)(fnχEn,k(x)

)=

En,k

pdµ , k = 1, . . . m(n), (3.65)

65

Page 75: Thesis Entropy Very Impotant

for any n. Note that in (3.65) the function fnχEn,k is a constant function by the con-

struction (Lemma 3.1) of fn. We have

m(n)∑

k=1

pn,k =

m(n)∑

k=1

En,k

pdµ =

Xpdµ = 1 , (3.66)

and hence pn is indeed a pmf. We call pn as the approximating sequence of pmfs of

pdf p.

Now we present our main theorem, where we assume that p and r are bounded. The

assumption of boundedness of p and r simplifies the proof. However, the result can be

extended to an unbounded case. (See (Renyi, 1959) analysis of Shannon entropy and

relative-entropy on R in the unbounded case.)

THEOREM 3.1 Let p and r be pdfs, which are bounded and defined on a measure space (X,M, µ).

Let pn and rn be approximating sequences of pmfs of p and r respectively. Let

Iα denote the Renyi relative-entropy as in (3.22) and Iq denote the Tsallis relative-

entropy as in (3.27). Then

limn→∞

Iα(pn‖rn) = Iα(p‖r) (3.67)

and

limn→∞

Iq(pn‖rn) = Iq(p‖r) , (3.68)

respectively.

Proof It is enough to prove the result for either Tsallis or Renyi since each one of them is a

monotone and continuous functions of the other. Hence we write down the proof for

the case of Renyi and we use the entropic index α in the proof.

Corresponding to pdf p, let fn be the approximating sequence of simple func-

tions such that limn→∞ fn = p as in Lemma 3.1. Let gn be the approximating se-

quence of simple functions for r such that limn→∞ gn = r. Corresponding to simple

functions fn and gn there exists a common measurable partition6 En,1, . . . En,m(n)such that fn and gn can be written as

fn(x) =

m(n)∑

k=1

(an,k)χEn,k(x) , an,k ∈ R+, ∀k = 1, . . . m(n) , (3.69a)

6Let ϕ and φ be two simple functions defined on (X, ). Let E1, . . . En and F1, . . . , Fm be themeasurable partitions corresponding to ϕ and φ respectively. Then the collection defined as Ei∩Fj |i =1, . . . n, j = 1, . . .m is a common measurable partition for ϕ and φ.

66

Page 76: Thesis Entropy Very Impotant

gn(x) =

m(n)∑

k=1

(bn,k)χEn,k(x) , bn,k ∈ R+, ∀k = 1, . . . m(n) , (3.69b)

where χEn,k is the characteristic function of En,k, for k = 1, . . . m(n). By (3.69a)

and (3.69b), the approximating sequences of pmfspn = (pn,1, . . . , pn,m(n))

(n)

andrn = (rn,1, . . . , rn,m(n))

(n)

can be written as (see (3.65))

pn,k = an,kµ(En,k) , k = 1, . . . ,m(n) , (3.70a)

rn,k = bn,kµ(En,k) , k = 1, . . . ,m(n) . (3.70b)

Now Renyi relative-entropy for pn and rn can be written as

Sα(pn‖rn) =1

α− 1ln

m(n)∑

k=1

aαn,k

bα−1n,k

µ(En,k) . (3.71)

To prove limn→∞ Sα(pn‖rn) = Sα(p‖r) it is enough to show that

limn→∞

1

α− 1ln

X

fn(x)α

gn(x)α−1 dµ(x) =

1

α− 1ln

X

p(x)α

r(x)α−1 dµ(x) , (3.72)

since we have7

X

fn(x)α

gn(x)α−1 dµ(x) =

m(n)∑

k=1

aαn,k

bα−1n,k

µ(En,k) . (3.73)

Further, it is enough to prove that

limn→∞

Xhn(x)

αgn(x) dµ(x) =

X

p(x)α

r(x)α−1 dµ(x) , (3.74)

where hn is defined as hn(x) = fn(x)gn(x) .

Case 1: 0 < α < 1

7Note that simple functions (fn)α and (gn)α−1 can be written as

(fn)α(x) =

m(n)k=1

aαn,k χEn,k(x) , and

(gn)α−1(x) =

m(n)k=1

bα−1n,k χEn,k

(x) .

Further,

fαn

gα−1n

(x) =

m(n)k=1 aαn,kbα−1

n,k χEn,k(x) .

67

Page 77: Thesis Entropy Very Impotant

In this case, the Lebesgue dominated convergence theorem (Rudin, 1964, pp.26,

Theorem 1.34) gives that,

limn→∞

X

fαn

gα−1n

dµ =

X

rα−1dµ , (3.75)

and hence (3.68) follows.

Case 2: α > 1

We have hαngn → pα

rα−1 a.e. By Fatou’s Lemma (Rudin, 1964, pp.23, Theorem

1.28), we obtain that,

limn→∞

inf

Xhn(x)

αgn(x) dµ(x) ≥∫

X

p(x)α

r(x)α−1 dµ(x) . (3.76)

From the construction of fn and gn (Lemma 3.1) we have

hn(x)gn(x) =1

µ(En,i)

En,i

p(x)

r(x)r(x) dµ(x) , ∀x ∈ En,i . (3.77)

By Jensen’s inequality we get

hn(x)αgn(x) ≤

1

µ(En,i)

En,i

p(x)α

r(x)α−1dµ(x) , ∀x ∈ En,i . (3.78)

By (3.69a) and (3.69b) we can write (3.78) as

aαn,i

bα−1n,i

µ(En,i) ≤∫

En,i

p(x)α

r(x)α−1dµ(x) , ∀i = 1, . . . m(n) . (3.79)

By summing over on both sides of (3.79) we get

m(n)∑

i=1

aαn,i

bα−1n,i

µ(En,i) ≤m(n)∑

i=1

En,i

p(x)α

r(x)α−1dµ(x) . (3.80)

Now (3.80) is nothing but∫

Xhαn(x)gn(x) dµ(x) ≤

X

p(x)α

r(x)α−1dµ(x) , ∀n ,

and hence

supi≥n

Xhαi (x)gi(x) dµ(x) ≤

X

p(x)α

r(x)α−1dµ(x) , ∀n .

Finally we have

limn→∞

sup

Xhαn(x)gn(x) dµ(x) ≤

X

p(x)α

r(x)α−1dµ(x) . (3.81)

From (3.76) and (3.81) we get

limn→∞

X

fn(x)α

gn(x)α−1dµ(x) =

X

p(x)α

r(x)α−1dµ(x) , (3.82)

and hence (3.68) follows.

68

Page 78: Thesis Entropy Very Impotant

3.5.2 On ME of Measure-Theoretic Definition of Tsallis Entropy

With the shortcomings of Shannon entropy that it cannot be naturally extended to

a non-discrete case, we have observed that Shannon entropy in a measure-theoretic

framework can be used in ME-prescriptions consistently with the discrete case. One

can easily see that generalized information measures of Renyi and Tsallis too cannot

be extended naturally to measure-theoretic cases, i.e., measure-theoretic definitions are

not equivalent to their corresponding discrete cases in the sense that they cannot be de-

fined as limits of sequences of finite discrete entropies corresponding to pmfs defined

on measurable partitions which approximate the pdf. One can use the same counter

example we discussed in § 3.1.1. In this section, we show that the ME-prescriptions in

the measure-theoretic case are consistent with the discrete case.

Proceeding as in the case of measure-theoretic entropy in § 3.1.3, by specifying

probability measures µ and P in discrete case as in (3.14a) and (3.14b) respectively,

the measure-theoretic Tsallis entropy Sq(P ) (3.26) can be reduced to

Sq(P ) =n∑

k=1

Pk lnqµk

Pk. (3.83)

By (2.46) we get

Sq(P ) =

n∑

k=1

Pqk [lnq µk − lnq Pk] . (3.84)

When µ is a uniform distribution i.e., µk = 1n , ∀k = 1, . . . n we get

Sq(P ) = Snq (P ) − nq−1 lnq n

n∑

k=1

Pqk , (3.85)

where Snq (P ) denotes the Tsallis entropy of pmf P = (P1, . . . , Pn) (2.31), and Sq(P )

denotes the measure-theoretic Tsallis entropy (3.26) reduced to the discrete case, with

the probability measures µ and P specified as in (3.14a) and (3.14b) respectively.

Now we show that the quantity∑n

k=1 Pqk is constant in maximization of Sq(P )

with respect to the set of constraints (3.50).

The claim is that∫

Xp(x)q dµ(x) = (Zq)

1−q, (3.86)

which holds for Tsallis maximum entropy distribution (3.54) in general. This can be

shown here as follows. I From the maximum entropy distribution (3.54), we have

p(x)1−q =

1 − (1 − q)

(∫

Xp(x)q dµ(x)

)−1 M∑

m=1

βm

(um(x) − 〈〈um〉〉q

)

(Zq)1−q ,

69

Page 79: Thesis Entropy Very Impotant

which can be rearranged as

(Zq)1−qp(x) =

1 − (1 − q)

∑Mm=1 βm

(um(x) − 〈〈um〉〉q

)

∫X p(x)

q dµ(x)

p(x)q .

By integrating both sides in the above equation, and by using (3.50) we get (3.86). J

Now, (3.86) can be written in its discrete form as

n∑

k=1

Pqk

µq−1k

= (Zq)1−q

. (3.87)

When µ is the uniform distribution we get

n∑

k=1

Pqk = n1−q(Zq)

1−q, (3.88)

which is a constant.

Hence, by (3.85) and (3.88), one can conclude that with respect to a particular in-

stance of ME, measure-theoretic Tsallis entropy Sq(P ) defined for a probability mea-

sure P on the measure space (X,M, µ) is equal to discrete Tsallis entropy up to an

additive constant, when the reference measure µ is chosen as a uniform probability dis-

tribution. There by, one can further conclude that with respect to a particular instance

of ME, measure-theoretic Tsallis entropy is consistent with its discrete definition.

The same result can be shown in the case of q-expectation values too.

3.6 Gelfand-Yaglom-Perez Theorem in the General Case

The measure-theoretic definition of KL-entropy plays a basic role in the definitions

of classical information measures. Entropy, mutual information and conditional forms

of entropy can be expressed in terms of KL-entropy and hence properties of their

measure-theoretic analogs will follow from those of measure-theoretic KL-entropy

(Gray, 1990). These measure-theoretic definitions are key to extending the ergodic

theorems of information theory to non-discrete cases. A fundamental theorem in this

respect is the Gelfand-Yaglom-Perez (GYP) Theorem (Pinsker, 1960b, Theorem 2.4.2)

which states that measure-theoretic relative-entropy equals the supremum of relative-

entropies over all measurable partitions. In this section we prove the GYP-theorem for

Renyi relative-entropy of order greater than one.

Before we proceed to the definitions and present the notion of relative-entropy

on a measurable partition, we recall our notation and introduce new symbols. Let

70

Page 80: Thesis Entropy Very Impotant

(X,M) be a measurable space and Π denote the set of all measurable partitions of X .

We denote a measurable partition π ∈ Π as π = Ekmk=1, i.e, ∪mk=1Ek = X and

Ei ∩ Ej = ∅, i 6= j, i, j = 1, . . . m. We denote the set of all simple functions on

(X,M) by L+0 , and the set of all nonnegative M-measurable functions by L

+. The set

of all µ-integrable functions, where µ is a measure defined on (X,M), is denoted by

L1(µ). Renyi relative-entropy Iα(P‖R) refers to (3.23), which can be written as

Iα(P‖R) =1

α− 1ln

Xϕα dR , (3.89)

where ϕ ∈ L1(R) is defined as ϕ = dPdR .

Let P and R be two probability measures on (X,M) such that P R. Relative-

entropy of partition π ∈ Π for P with respect to R is defined as

IP‖R(π) =

m∑

k=1

P (Ek) lnP (Ek)

R(Ek). (3.90)

The GYP-theorem states that

I(P‖R) = supπ∈Π

IP‖R(π) , (3.91)

where I(P‖R) measure-theoretic KL-entropy is defined as in Definition 3.2. When

P is not absolutely continuous with respect to R, GYP-theorem assigns I(P‖R) =

+∞. The proof of GYP-theorem given by Dobrushin (Dobrushin, 1959) can be found

in (Pinsker, 1960b, pp. 23, Theorem 2.4.2) or in (Gray, 1990, pp. 92, Lemma 5.2.3).

Before we state and prove the GYP-theorem for Renyi relative-entropy of order

α > 1, we state the following lemma.

LEMMA 3.2 Let P and R be probability measures on the measurable space (X,M) such that

P R. Let ϕ = dPdR . Then for any E ∈ M and α > 1 we have

P (E)α

R(E)α−1≤∫

Eϕα dR . (3.92)

Proof Since P (E) =∫E ϕdR, ∀E ∈ M, by Holder’s inequality we have

EϕdR ≤

(∫

Eϕα dR

) 1α(∫

EdR

)1− 1α

.

That is

P (E)α ≤ R(E)α(1− 1α

)∫

Eϕα dR ,

and hence (3.92) follows.

71

Page 81: Thesis Entropy Very Impotant

We now present our main result in a special case as follows.

LEMMA 3.3 Let P and R be two probability measures such that P R. Let ϕ = dPdR ∈ L

+0 .

Then for any 0 < α <∞, we have

Iα(P‖R) =1

α− 1ln

m∑

k=1

P (Ek)α

R(Ek)α−1 , (3.93)

where Ekmk=1 ∈ Π is the measurable partition corresponding to ϕ.

Proof The simple function ϕ ∈ L+0 can be written as ϕ(x) =

∑mk=1 akχEk(x), ∀x ∈ X ,

where ak ∈ R, k = 1, . . . m. Now we have P (Ek) =∫EkϕdR = akR(Ek), and

hence

ak =P (Ek)

R(Ek), ∀k = 1, . . . m. (3.94)

We also have ϕα(x) =∑m

k=1 aαkχEk(x), ∀x ∈ X and hence

Xϕα dR =

m∑

k=1

aαkR(Ek) . (3.95)

Now, from (3.89), (3.94) and (3.95) one obtains (3.93).

Note that right hand side of (3.93) represents the Renyi relative-entropy of the

partition Ekmk=1 ∈ Π. Now we state and prove GYP-theorem for Renyi relative-

entropy.

THEOREM 3.2 Let (X,M) be a measurable space and Π denote the set of all measurable partitions

of X . Let P and R be two probability measures. Then for any α > 1, we have

Iα(P‖R) =

supEkmk=1∈Π

1

α− 1ln

m∑

k=1

P (Ek)α

R(Ek)α−1 if P R ,

+∞ otherwise.

(3.96)

Proof If P is not absolutely continuous with respectR, there existsE ∈ M such that P (E) >

0 and R(E) = 0. Since E,X −E ∈ Π, Iα(P‖R) = +∞.

Now, we assume that P R. It is clear that it is enough to prove that

Xϕα dR = sup

Ekmk=1∈Π

m∑

k=1

P (Ek)α

R(Ek)α−1 , (3.97)

72

Page 82: Thesis Entropy Very Impotant

where ϕ = dPdR . From Lemma 3.2, for any measurable partition Ekmk=1 ∈ Π, we

have

m∑

k=1

P (Ek)α

R(Ek)α−1 ≤

m∑

k=1

Ek

ϕα dR =

Xϕα dR ,

and hence

supEkmk=1∈Π

m∑

k=1

P (Ek)α

R(Ek)α−1 ≤

Xϕα dR . (3.98)

Now we shall obtain the reverse inequality to prove (3.97) . Thus, we now show

supEkmk=1∈Π

m∑

k=1

P (Ek)α

R(Ek)α−1 ≥

Xϕα dR . (3.99)

Note that corresponding to any ϕ ∈ L+, there exists a sequence of simple functions

ϕn, ϕn ∈ L+0 , that satisfy

0 ≤ ϕ1 ≤ ϕ2 ≤ . . . ≤ ϕ (3.100)

such that limn→∞ ϕn = ϕ (Kantorovitz, 2003, Theorem 1.8(2)). ϕn induces a

sequence of measures Pn on (X,M) defined by

Pn(E) =

Eϕn(x) dR(x) , ∀E ∈ M. (3.101)

We have∫E ϕn dR ≤

∫E ϕdR < ∞,∀E ∈ M and hence Pn R, ∀n. From the

Lebesgue bounded convergence theorem, we have

limn→∞

Pn(E) = P (E) , ∀E ∈ M . (3.102)

Now, ϕn ∈ L+0 , ϕαn ≤ ϕαn+1 ≤ ϕα, 1 ≤ n <∞ and limn→∞ ϕαn = ϕα for any α > 0.

Hence from Lebesgue monotone convergence theorem we have

limn→∞

Xϕαn dR =

Xϕα dR . (3.103)

We now claim that (3.103) implies

∫ϕα dR = sup

XφdR | 0 ≤ φ ≤ ϕα , φ ∈ L

+0

. (3.104)

This can be verified as follows. Denote φn = ϕαn . We have 0 ≤ φ ≤ ϕα, ∀n, φn ↑ ϕα,

and (as shown above)

limn→∞

Xφn dR =

Xϕα dR . (3.105)

73

Page 83: Thesis Entropy Very Impotant

Now for any φ ∈ L+0 such that 0 ≤ φ ≤ ϕα we have

XφdR ≤

Xϕα dR

and hence

sup

XφdR | 0 ≤ φ ≤ ϕα , φ ∈ L

+0

≤∫ϕα dR . (3.106)

Now we show the reverse inequality of (3.106). If∫X ϕ

α dR < +∞, from (3.105)

given any ε > 0 one can find 0 ≤ n0 <∞ such that∫

Xϕα dR <

Xφn0 dR+ ε

and hence∫

Xϕα dR < sup

XφdR | 0 ≤ φ ≤ ϕα , φ ∈ L

+0

+ ε . (3.107)

Since (3.107) is true for any ε > 0, we can write∫

Xϕα dR ≤ sup

XφdR | 0 ≤ φ ≤ ϕα , φ ∈ L

+0

. (3.108)

Now let us verify (3.108) in the case of∫X ϕ

α dR = +∞. In this case, ∀N > 0, one

can choose n0 such that∫X φn0 dR > N and hence

Xϕα dR > N (∵ 0 ≤ φn0 ≤ ϕα) (3.109)

and

sup

XφdR | 0 ≤ φ ≤ ϕα , φ ∈ L

+0

> N . (3.110)

Since (3.109) and (3.110) are true for any N > 0, we have∫

Xϕα dR = sup

XφdR | 0 ≤ φ ≤ ϕα , φ ∈ L

+0

= +∞ (3.111)

and hence (3.108) is verified in the case of∫X ϕ

α dR = +∞. Now (3.106) and (3.108)

verify the claim that (3.103) implies (3.104). Finally (3.104) together with Lemma 3.3

proves (3.97) and hence the theorem.

Now from the fact that Renyi and Tsallis relative-entropies ((3.23) and (3.28) re-

spectively) are monotone and continuous functions of each other, the GYP-theorem

presented in the case of Renyi is valid for the Tsallis case too, whenever q > 1.

However, the GYP-theorem is yet to be stated for the case when entropic index

0 < α < 1 ( 0 < q < 1 in the case of Tsallis). Work on this problem is ongoing.

74

Page 84: Thesis Entropy Very Impotant

4 Geometry and Entropies:Pythagoras’ Theorem

Abstract

Kullback-Leibler relative-entropy, in cases involving distributions resulting fromrelative-entropy minimization, has a celebrated property reminiscent of squared Eu-clidean distance: it satisfies an analogue of the Pythagoras’ theorem. And hence,this property is referred to as Pythagoras’ theorem of relative-entropy minimizationor triangle equality and plays a fundamental role in geometrical approaches of statis-tical estimation theory like information geometry. We state and prove the equivalentof Pythagoras’ theorem in the generalized nonextensive formalism as the main resultof this chapter. Before presenting this result we study the Tsallis relative-entropyminimization and present some differences with the classical case. This work canalso be found in (Dukkipati et al., 2005b; Dukkipati, Murty, & Bhatnagar, 2006a).

Apart from being a fundamental measure of information, Kullback-Leibler relative-

entropy or KL-entropy plays a role of ‘measure of the distance’ between two probabil-

ity distributions in statistics. Since it is not a metric, at first glance, it might seem that

the geometrical interpretations that metric distance measures provide usually might

not be possible at all with the KL-entropy playing a role as a distance measure on a

space of probability distributions. But it is a pleasant surprise that it is possible to for-

mulate certain geometric propositions for probability distributions, with the relative-

entropy playing the role of squared Euclidean distance. Some of these geometrical

interpretations cannot be derived from the properties of KL-entropy alone, but from

the properties of “KL-entropy minimization”; restating the previous statement, these

geometrical formulations are possible only when probability distributions resulting

from ME-prescriptions of KL-entropy are involved.

As demonstrated by Kullback (1959), minimization problems of relative-entropy

with respect to a set of moment constraints find their importance in the well known

Kullback’s minimum entropy principle and thereby play a basic role in the information-

theoretic approach to statistics (Good, 1963; Ireland & Kullback, 1968). They fre-

quently occur elsewhere also, e.g., in the theory of large deviations (Sanov, 1957), and

in statistical physics, as maximization of entropy (Jaynes, 1957a, 1957b).

Kullback’s minimum entropy principle can be considered as a general method of

75

Page 85: Thesis Entropy Very Impotant

inference about an unknown probability distribution when there exists a prior estimate

of the distribution and new information in the form of constraints on expected val-

ues (Shore, 1981b). Formally, one can state this principle as: given a prior distribution

r, of all the probability distributions that satisfy the given moment constraints, one

should choose the posterior p with the least relative-entropy. The prior distribution

r can be a reference distribution (uniform, Gaussian, Lorentzian or Boltzmann etc.)

or a prior estimate of p. The principle of Jaynes maximum entropy is a special case

of minimization of relative-entropy under appropriate conditions (Shore & Johnson,

1980).

Many properties of relative-entropy minimization just reflect well-known prop-

erties of relative-entropy but there are surprising differences as well. For example,

relative-entropy does not generally satisfy a triangle relation involving three arbitrary

probability distributions. But in certain important cases involving distributions that

result from relative-entropy minimization, relative-entropy results in a theorem com-

parable to the Pythagoras’ theorem cf. (Csiszar, 1975) and (Cencov, 1982, § 11). In

this geometrical interpretation, relative-entropy plays the role of squared distance and

minimization of relative-entropy appears as the analogue of projection on a sub-space

in a Euclidean geometry. This property is also known as triangle equality (Shore,

1981b).

The main aim of this chapter is to study the possible generalization of Pythagoras’

theorem to the nonextensive case. Before we take up this problem, we present the

properties of Tsallis relative-entropy minimization and present some differences with

the classical case. In the representation of such a minimum entropy distribution, we

highlight the use of the q-product (q-deformed version of multiplication), an operator

that has been introduced recently to derive the mathematical structure behind the Tsal-

lis statistics. Especially, q-product representation of Tsallis minimum relative-entropy

distribution will be useful for the derivation of the equivalent of triangle equality for

Tsallis relative-entropy.

Before we conclude this introduction on geometrical ideas of relative-entropy min-

imization, we make a note on the other geometric approaches that will not be consid-

ered in this thesis. One approach is that of Rao (1945), where one looks at the set of

probability distributions on a sample space as a differential manifold and introduce a

Riemannian geometry on this manifold. This approach is pioneered by Cencov (1982)

and Amari (1985) who have shown the existence of a particular Riemannian geometry

which is useful in understanding some questions of statistical inference. This Rieman-

nian geometry turns out to have some interesting connections with information theory

76

Page 86: Thesis Entropy Very Impotant

and as shown by Campbell (1985), with the minimum relative-entropy. In this ap-

proach too, the above mentioned Pythagoras’ Theorem plays an important role (Amari

& Nagaoka, 2000, pp.72).

The other idea involves the use of Hausdorff dimension (Billingsley, 1960, 1965)

to understand why minimizing relative-entropy should provide useful results. This

approach was begun by Eggleston (1952) for a special case of maximum entropy and

was developed by Campbell (1992). For an excellent review on various geometrical

aspects associated with minimum relative-entropy one can refer to (Campbell, 2003).

The structure of the chapter is organized as follows. We present the necessary

background in § 4.1, where we discuss properties of relative-entropy minimization in

the classical case. In § 4.2, we present the ME prescriptions of Tsallis relative-entropy

and discuss its differences with the classical case. Finally, the derivation of Pythagoras’

theorem in the nonextensive case is presented in § 4.3.

Regarding the notation, we use the same notation as in Chapter 3, and we write all

our mathematical formulations on the measure space (X,M, µ). All the assumptions

we made in Chapter 3 (see § 3.2) are valid here too. Also, though results presented in

this chapter do not involve major measure theoretic concepts, we write all the integrals

with respect to the measure µ, as a convention; these integrals can be replaced by

summations in the discrete case or Lebesgue integrals in the continuous case.

4.1 Relative-Entropy Minimization in the Classical Case

Kullback’s minimum entropy principle can stated formally as follows. Given a prior

distribution r with a finite set of moment constraints of the form∫

Xum(x)p(x) dµ(x) = 〈um〉 , m = 1, . . . ,M , (4.1)

one should choose the posterior p which minimizes the relative-entropy

I(p‖r) =

Xp(x) ln

p(x)

r(x)dµ(x) . (4.2)

In (4.1), 〈um〉, m = 1, . . . ,M are the known expectation values of M-measurable

functions um : X → R, m = 1, . . . ,M respectively.

With reference to (4.2) we clarify here that, though we mainly use expressions of

relative-entropy defined for pdfs in this chapter, we use expressions in terms of corre-

sponding probability measures as well. For example, when we write the Lagrangian

77

Page 87: Thesis Entropy Very Impotant

for relative-entropy minimization below, we use the definition of relative-entropy (3.7)

for probability measures P and R, corresponding to pdfs p and r respectively (refer to

Definitions 3.2 and 3.3). This correspondence between probability measures P and R

with pdfs p and r, respectively, will not be described again in the sequel.

4.1.1 Canonical Minimum Entropy Distribution

To minimize the relative-entropy (4.2) with respect to the constraints (4.1), the La-

grangian turns out to be

L(x, λ, β) =

Xln

dP

dR(x) dP (x) + λ

(∫

XdP (x) − 1

)

+M∑

m=1

βm

(∫

Xum(x) dP (x) − 〈um〉

), (4.3)

where λ and βm, m = 1, . . . M are Lagrange multipliers. The solution is given by

lndP

dR(x) + λ+

M∑

m=1

βmum(x) = 0 ,

and the solution can be written in the form of

dP

dR(x) =

e− Mm=1 βmum(x)

Xe−

Mm=1 βmum(x) dR

. (4.4)

Finally, from (4.4) the posterior distribution p(x) = dPdµ given by Kullback’s minimum

entropy principle can be written in terms of the prior r(x) = dRdµ as

p(x) =r(x)e−

Mm=1 βmum(x)

Z, (4.5)

where

Z =

Xr(x)e−

Mm=1 βmum(x) dµ(x) (4.6)

is the partition function.

Relative-entropy minimization has been applied to many problems in statistics (Kull-

back, 1959) and statistical mechanics (Hobson, 1971). The other applications include

pattern recognition (Shore & Gray, 1982), spectral analysis (Shore, 1981a), speech

coding (Markel & Gray, 1976), estimation of prior distribution for Bayesian infer-

ence (Caticha & Preuss, 2004) etc. For a list of references on applications of relative-

entropy minimization see (Shore & Johnson, 1980) and a recent paper (Cherney &

Maslov, 2004).

78

Page 88: Thesis Entropy Very Impotant

Properties of relative-entropy minimization have been studied extensively and pre-

sented by Shore (1981b). Here we briefly mention a few.

The principle of maximum entropy is equivalent to relative-entropy minimization

in the special case of discrete spaces and uniform priors, in the sense that, when the

prior is a uniform distribution with finite support W (over E ⊂ X), the minimum

entropy distribution turns out to be

p(x) =e−

Mm=1 βmum(x)

Ee−

Mm=1 βmum(x) dµ(x)

, (4.7)

which is in fact, a maximum entropy distribution (3.33) of Shannon entropy with re-

spect to the constraints (4.1).

The important relations to relative-entropy minimization are as follows. Minimum

relative-entropy, I , can be calculated as

I = − ln Z −M∑

m=1

βm〈um〉 , (4.8)

while the thermodynamic equations are

∂βmln Z = −〈um〉 , m = 1, . . . M, (4.9)

and

∂I

∂〈um〉= −βm , m = 1, . . . M. (4.10)

4.1.2 Pythagoras’ Theorem

The statement of Pythagoras’ theorem of relative-entropy minimization can be formu-

lated as follows (Csiszar, 1975).

THEOREM 4.1 Let r be the prior, p be the probability distribution that minimizes the relative-

entropy subject to a set of constraints∫

Xum(x)p(x) dµ(x) = 〈um〉 , m = 1, . . . ,M , (4.11)

with respect to M-measurable functions um : X → R, m = 1, . . . M whose expec-

tation values 〈um〉, m = 1, . . . M are (assumed to be) a priori known. Let l be any

other distribution satisfying the same constraints (4.11), then we have the triangle

equality

I(l‖r) = I(l‖p) + I(p‖r) . (4.12)

79

Page 89: Thesis Entropy Very Impotant

Proof We have

I(l‖r) =

Xl(x) ln

l(x)

r(x)dµ(x)

=

Xl(x) ln

l(x)

p(x)dµ(x) +

Xl(x) ln

p(x)

r(x)dµ(x)

= I(l‖p) +

Xl(x) ln

p(x)

r(x)dµ(x) (4.13)

From the minimum entropy distribution (4.5) we have

lnp(x)

r(x)= −

M∑

m=1

βmum(x) − ln Z . (4.14)

By substituting (4.14) in (4.13) we get

I(l‖r) = I(l‖p) +

Xl(x)

M∑

m=1

βmum(x) − ln Z

dµ(x)

= I(l‖p) −M∑

m=1

βm

Xl(x)um(x) dµ(x)

− ln Z

= I(l‖p) −M∑

m=1

βm〈um〉 − ln Z (By hypothesis)

= I(l‖p) + I(p‖r) . (By (4.8))

A simple consequence of the above theorem is that

I(l‖r) ≥ I(p‖r) (4.15)

since I(l‖p) ≥ 0 for every pair of pdfs, with equality if and only if l = p. A pictorial

depiction of the triangle equality (4.12) is shown in Figure 4.1.

r p

l

Figure 4.1: Triangle Equality of Relative-Entropy Minimization

Detailed discussions on the importance of Pythagoras’ theorem of relative-entropy

minimization can be found in (Shore, 1981b) and (Amari & Nagaoka, 2000, pp. 72).

80

Page 90: Thesis Entropy Very Impotant

For a study of relative-entropy minimization without the use of Lagrange multiplier

technique and corresponding geometrical aspects, one can refer to (Csiszar, 1975).

Triangle equality of relative-entropy minimization not only plays a fundamental

role in geometrical approaches of statistical estimation theory (Cencov, 1982) and

information geometry (Amari, 1985, 2001) but is also important for applications in

which relative-entropy minimization is used for purposes of pattern classification and

cluster analysis (Shore & Gray, 1982).

4.2 Tsallis Relative-Entropy Minimization

Unlike the generalized entropy measures, ME of generalized relative-entropies is not

much addressed in the literature. Here, one has to mention the work of Borland et al.

(1998), where they give the minimum relative-entropy distribution of Tsallis relative-

entropy with respect to the constraints in terms of q-expectation values.

In this section, we study several aspects of Tsallis relative-entropy minimization.

First we derive the minimum entropy distribution in the case of q-expectation values

(3.38) and then in the case of normalized q-expectation values (3.50). We propose

an elegant representation of these distributions by using q-deformed binary operator

called q-product ⊗q. This operator is defined by Borges (2004) along similar lines

as q-addition ⊕q and q-subtraction q that we discussed in § 2.3.2. Since q-product

plays an important role in nonextensive formalism, we include a detailed discussion

on the q-product in this section. Finally, we study properties of Tsallis relative-entropy

minimization and its differences with the classical case.

4.2.1 Generalized Minimum Relative-Entropy Distribution

To minimize Tsallis relative-entropy

Iq(p‖r) = −∫

Xp(x) lnq

r(x)

p(x)dµ(x) (4.16)

with respect to the set of constraints specified in terms of q-expectation values∫

Xum(x)p(x)q dµ(x) = 〈um〉q ,m = 1, . . . ,M, (4.17)

the concomitant variational principle is given as follows: Define

L(x, λ, β) =

Xlnq

r(x)

p(x)dP (x) − λ

(∫

XdP (x) − 1

)

−M∑

m=1

βm

(∫

Xp(x)q−1um(x) dP (x) − 〈um〉q

)(4.18)

81

Page 91: Thesis Entropy Very Impotant

where λ and βm, m = 1, . . . M are Lagrange multipliers. Now set

dL

dP= 0 . (4.19)

The solution is given by

lnqr(x)

p(x)− λ− p(x)q−1

M∑

m=1

βmum(x) = 0 ,

which can be rearranged by using the definition of q-logarithm lnq x = x1−q−11−q as

p(x) =

[r(x)1−q − (1 − q)

∑Mm=1 βmum(x)

] 11−q

(λ(1 − q) + 1)1

1−q

.

Specifying the Lagrange parameter λ via the normalization∫X p(x) dµ(x) = 1, one

can write Tsallis minimum relative-entropy distribution as (Borland et al., 1998)

p(x) =

[r(x)1−q − (1 − q)

M∑

m=1

βmum(x)

] 11−q

Zq, (4.20)

where the partition function is given by

Zq =

X

[r(x)1−q − (1 − q)

M∑

m=1

βmum(x)

] 11−q

dµ(x) . (4.21)

The values of the Lagrange parameters βm, m = 1, . . . ,M are determined using the

constraints (4.17).

4.2.2 q-Product Representation for Tsallis Minimum Entropy Distribution

Note that the generalized relative-entropy distribution (4.20) is not of the form of its

classical counterpart (4.5) even if we replace the exponential with the q-exponential.

But one can express (4.20) in a form similar to the classical case by invoking q-

deformed binary operation called q-product.

In the framework of q-deformed functions and operators discussed in Chapter 2, a

new multiplication, called q-product defined as

x⊗q y ≡

(x1−q + y1−q − 1

) 11−q if x, y > 0,

x1−q + y1−q − 1 > 00 otherwise.

(4.22)

82

Page 92: Thesis Entropy Very Impotant

is first introduced in (Nivanen et al., 2003) and explicitly defined in (Borges, 2004) for

satisfying the following equations:

lnq(x⊗q y)=lnq x+ lnq y , (4.23)

exq ⊗q eyq=e

x+yq . (4.24)

The q-product recovers the usual product in the limit q → 1 i.e., limq→1(x ⊗q y) =

xy. The fundamental properties of the q-product ⊗q are almost the same as the usual

product, and the distributive law does not hold in general, i.e.,

a(x⊗q y) 6= ax⊗q y (a, x, y ∈ R) .

Further properties of the q-product can be found in (Nivanen et al., 2003; Borges,

2004).

One can check the mathematical validity of the q-product by recalling the expres-

sion of the exponential function ex

ex = limn→∞

(1 +

x

n

)n. (4.25)

Replacing the power on the right side of (4.25) by n times the q-product ⊗q:

x⊗qn

= x⊗q . . . ⊗q x︸ ︷︷ ︸n times

, (4.26)

one can verify that (Suyari, 2004b)

exq = limn→∞

(1 +

x

n

)⊗qn. (4.27)

Further mathematical significance of q-product is demonstrated in (Suyari & Tsukada,

2005) by discovering the mathematical structure of statistics based on the Tsallis for-

malism: law of error, q-Stirling’s formula, q-multinomial coefficient and experimental

evidence of q-central limit theorem.

Now, one can verify the non-trivial fact that Tsallis minimum entropy distribution

(4.20) can be expressed as (Dukkipati, Murty, & Bhatnagar, 2005b),

p(x) =r(x) ⊗q e

− Mm=1 βmum(x)

q

Zq, (4.28)

where

Zq =

Xr(x) ⊗q e

− Mm=1 βmum(x)

q dµ(x). (4.29)

83

Page 93: Thesis Entropy Very Impotant

Later in this chapter we see that this representation is useful in establishing properties

of Tsallis relative-entropy minimization and corresponding thermodynamic equations.

It is important to note that the distribution in (4.20) could be a (local/global) min-

imum only if q > 0 and the Tsallis cut-off condition (3.46) specified by Tsallis maxi-

mum entropy distribution is extended to the relative-entropy case i.e., p(x) = 0 when-

ever[r(x)1−q − (1 − q)

∑Mm=1 βmum(x)

]< 0. The latter condition is also required

for the q-product representation of the generalized minimum entropy distribution.

4.2.3 Properties

As we mentioned earlier, in the classical case, that is when q = 1, relative-entropy

minimization with uniform distribution as a prior is equivalent to entropy maximiza-

tion. But, in the case of nonextensive framework, this is not true. Let r be the uniform

distribution with finite support W over E ⊂ X . Then, by (4.20) one can verify that

the probability distribution which minimizes Tsallis relative-entropy is

p(x) =

[1

W 1−q − (1 − q)M∑

m=1

βmum(x)

] 11−q

E

[1

W 1−q − (1 − q)

M∑

m=1

βmum(x)

] 11−q

dµ(x)

,

which can be written as

p(x) =e−W q−1 lnqW−

Mm=1 βmum(x)

q∫

Ee−W q−1 lnqW−

Mm=1 βmum(x)

q dµ(x)(4.30)

or

p(x) =e−W 1−q

Mm=1 βmum(x)

q∫

Ee−W 1−q

Mm=1 βmum(x)

q dµ(x)

. (4.31)

By comparing (4.30) or (4.31) with Tsallis maximum entropy distribution (3.44), one

can conclude (formally one can verify this by the thermodynamic equations of Tsal-

lis entropy (3.37)) that minimizing relative-entropy is not equivalent1 to maximizing1For fixed q-expected values 〈um〉

q, the two distributions, (4.31) and (3.44) are equal, but the values

of corresponding Lagrange multipliers are different when q 6= 1 (while in the classical case they remainsame). Further, (4.31) offers the relation between the Lagrange parameters in these two cases. Letβ

(S)m , m = 1, . . .M be the Lagrange parameters corresponding to the generalized maximum entropy

distribution while β(I)m , m = 1, . . .M correspond to generalized minimum entropy distribution with

uniform prior. Then, we have the relation β(S)m = W 1−qβ

(I)m , m = 1, . . .M .

84

Page 94: Thesis Entropy Very Impotant

entropy when the prior is a uniform distribution. The key observation here is that W

appears in (4.31) unlike in (3.44).

In this case, one can calculate minimum relative-entropy Iq as

Iq = − lnq Zq −M∑

m=1

βm〈um〉q . (4.32)

To demonstrate the usefulness of q-product representation of generalized minimum

entropy distribution we present the verification (4.32). I By using the property of

q-multiplication (4.24), Tsallis minimum relative-entropy distribution (4.28) can be

written as

p(x)Zq = e−

Mm=1 βmum(x)+lnq r(x)

q .

By taking q-logarithm on both sides, we get

lnq p(x) + lnq Zq + (1 − q) lnq p(x) lnq Zq = −M∑

m=1

βmum(x) + lnq r(x)

By the property of q-logarithm lnq

(xy

)= yq−1(lnq x− lnq y), we have

lnqr(x)

p(x)= p(x)q−1

lnq Zq + (1 − q) lnq p(x) lnq Zq +

M∑

m=1

βmum(x)

.

(4.33)

By substituting (4.33) in Tsallis relative-entropy (4.16) we get

Iq = −∫

Xp(x)q

lnq Zq + (1 − q) lnq p(x) lnq Zq +

M∑

m=1

βmum(x)

dµ(x) .

By (4.17) and expanding lnq p(x) one can write Iq in its final form as in (4.32). J

It is easy to verify the following thermodynamic equations for the minimum Tsallis

relative-entropy:

∂βmlnq Zq = −〈um〉q , m = 1, . . . M, (4.34)

∂Iq

∂〈um〉q= −βm , m = 1, . . . M, (4.35)

which generalize thermodynamic equations in the classical case.

85

Page 95: Thesis Entropy Very Impotant

4.2.4 The Case of Normalized q-Expectations

In this section we discuss Tsallis relative-entropy minimization with respect to the

constraints in the form of normalized q-expectations∫X um(x)p(x)q dµ(x)∫

X p(x)q dµ(x)

= 〈〈um〉〉q ,m = 1, . . . ,M. (4.36)

The variational principle for Tsallis relative-entropy minimization in this case is as

below. Let

L(x, λ, β) =

Xlnq

r(x)

p(x)dP (x) − λ

(∫

XdP (x) − 1

)

−M∑

m=1

β(q)m

(∫

Xp(x)q−1

(um(x) − 〈〈um〉〉q

)dP (x)

), (4.37)

where the parameters β(q)m can be defined in terms of the true Lagrange parameters

βm as

β(q)m =

βm∫

Xp(x)q dµ(x)

, m = 1, . . . ,M. (4.38)

This gives minimum entropy distribution as

p(x) =1

Zq

r(x)1−q − (1 − q)

∑Mm=1 βm

(um(x) − 〈〈um〉〉q

)

∫X p(x)

q dµ(x)

11−q

(4.39)

where

Zq =

X

r(x)1−q − (1 − q)

∑Mm=1 βm

(um(x) − 〈〈um〉〉q

)

∫X p(x)

q dµ(x)

11−q

dµ(x) .

Now, the minimum entropy distribution (4.39) can be expressed using the q-product

(4.22) as

p(x) =1

Zq

r(x) ⊗q expq

∑M

m=1 βm

(um(x) − 〈〈um〉〉q

)

∫X p(x)

q dµ(x)

. (4.40)

Minimum Tsallis relative-entropy Iq in this case satisfies

Iq = − lnq Zq , (4.41)

while one can derive the following thermodynamic equations:

∂βmlnq Zq = −〈〈um〉〉q , m = 1, . . . M, (4.42)

86

Page 96: Thesis Entropy Very Impotant

∂Iq

∂〈〈um〉〉q= −βm , m = 1, . . . M, (4.43)

where

lnq Zq = lnq Zq −M∑

m=1

βm〈〈um〉〉q . (4.44)

4.3 Nonextensive Pythagoras’ Theorem

With the above study of Tsallis relative-entropy minimization, in this section, we

present our main result, Pythagoras’ theorem or triangle equality (Theorem 4.1) gener-

alized to the nonextensive case. To present this result, we shall discuss the significance

of triangle equality in the classical case. We restate Theorem 4.1 which is essential for

the derivation of the triangle equality in the nonextensive framework.

4.3.1 Pythagoras’ Theorem Restated

Significance of the triangle equality is evident in the following scenario. Let r be the

prior estimate of the unknown probability distribution l, about which, the information

in the form of constraints∫

Xum(x)l(x) dµ(x) = 〈um〉 , m = 1, . . . M (4.45)

is available with respect to the fixed functions um, m = 1, . . . ,M . The problem

is to choose a posterior estimate p that is in some sense the best estimate of l given

by the available information i.e., prior r and the information in the form of expected

values (4.45). Kullback’s minimum entropy principle provides a general solution to

this inference problem and provides us the estimate (4.5) when we minimize relative-

entropy I(p‖r) with respect to the constraints∫

Xum(x)p(x) dµ(x) = 〈um〉 , m = 1, . . . M . (4.46)

This estimate of posterior p by Kullback’s minimum entropy principle also offers

the relation (Theorem 4.1)

I(l‖r) = I(l‖p) + I(p‖r) , (4.47)

from which one can draw the following conclusions. By (4.15), the minimum relative-

entropy posterior estimate of l is not only logically consistent, but also closer to l, in the

87

Page 97: Thesis Entropy Very Impotant

relative-entropy sense, that is the prior r. Moreover, the difference I(l‖r) − I(l‖p) is

exactly the relative-entropy I(p‖r) between the posterior and the prior. Hence, I(p‖r)can be interpreted as the amount of information provided by the constraints that is not

inherent in r.

Additional justification to use minimum relative-entropy estimate of pwith respect

to the constraints (4.46) is provided by the following expected value matching prop-

erty (Shore, 1981b). To explain this concept we restate our above estimation problem

as follows.

For fixed functions um, m = 1, . . . M , let the actual unknown distribution l satisfy∫

Xum(x)l(x) dµ(x) = 〈wm〉 , m = 1, . . . M, (4.48)

where 〈wm〉, m = 1, . . . M are expected values of l, the only information available

about l apart from the prior r. To apply minimum entropy principle to estimate poste-

rior estimation p of l, one has to determine the constraints for p with respect to which

we minimize I(p‖r). Equivalently, by assuming that p satisfies the constraints of the

form (4.46), one has to determine the expected values 〈um〉, m = 1, . . . ,M .

Now, as 〈um〉, m = 1, . . . ,M vary, one can show that Iq(l‖p) has the minimum

value when

〈um〉 = 〈wm〉 , m = 1, . . . M. (4.49)

The proof is as follows (Shore, 1981b). I Proceeding as in the proof of Theorem 4.1,

we have

I(l‖p) = I(l‖r) +M∑

m=1

βm

Xl(x)um(x) dµ(x)

+ ln Z

= I(l‖r) +

M∑

m=1

βm〈wm〉 + ln Z (By (4.48)) (4.50)

Since the variation of I(l‖p) with respect to 〈um〉 results in the variation of I(l‖p)with respect to βm for any m = 1, . . . ,M , to find the minimum of I(l‖p) one can

solve

∂βmIq(l‖p) = 0 , m = 1, . . . M ,

which gives the solution as in (4.49). J

This property of expectation matching states that, for a distribution p of the form

(4.5), I(l‖p) is the smallest when the expected values of p match those of l. In partic-

ular, p is not only the distribution that minimizes I(p‖r) but also minimizes I(l‖p).

88

Page 98: Thesis Entropy Very Impotant

We now restate the Theorem 4.1 which summarizes the above discussion.

THEOREM 4.2 Let r be the prior distribution, and p be the probability distribution that minimizes

the relative-entropy subject to a set of constraints∫

Xum(x)p(x) dµ(x) = 〈um〉 , m = 1, . . . ,M. (4.51)

Let l be any other distribution satisfying the constraints∫

Xum(x)l(x) dµ(x) = 〈wm〉 , m = 1, . . . ,M. (4.52)

Then

1. I1(l‖p) is minimum only if (expectation matching property)

〈um〉 = 〈wm〉 , m = 1, . . . M. (4.53)

2. When (4.53) holds, we have

I(l‖r) = I(l‖p) + I(p‖r) (4.54)

By the above interpretation of triangle equality and analogy with the compara-

ble situation in Euclidean geometry, it is natural to call p, as defined by (4.5) as the

projection of r on the plane described by (4.52). Csiszar (1975) has introduced a gen-

eralization of this notion to define the projection of r on any convex set E of probability

distributions. If p ∈ E satisfies the equation

I(p‖r) = mins∈ I(s‖r) , (4.55)

then p is called the projection of r on E. Csiszar (1975) develops a number of results

about these projections for both finite and infinite dimensional spaces. In this thesis,

we will not consider this general approach.

4.3.2 The Case of q-Expectations

From the above discussion, it is clear that to derive the triangle equality of Tsallis

relative-entropy minimization, one should first deduce the equivalent of expectation

matching property in the nonextensive case.

We state below and prove the Pythagoras theorem in nonextensive framework

(Dukkipati, Murty, & Bhatnagar, 2006a).

89

Page 99: Thesis Entropy Very Impotant

THEOREM 4.3 Let r be the prior distribution, and p be the probability distribution that minimizes

the Tsallis relative-entropy subject to a set of constraints∫

Xum(x)p(x)q dµ(x) = 〈um〉q , m = 1, . . . ,M. (4.56)

Let l be any other distribution satisfying the constraints∫

Xum(x)l(x)q dµ(x) = 〈wm〉q , m = 1, . . . ,M. (4.57)

Then

1. Iq(l‖p) is minimum only if

〈um〉q =〈wm〉q

1 − (1 − q)Iq(l‖p), m = 1, . . . M. (4.58)

2. Under (4.58), we have

Iq(l‖r) = Iq(l‖p) + Iq(p‖r) + (q − 1)Iq(l‖p)Iq(p‖r) . (4.59)

Proof First we deduce the equivalent of expectation matching property in the nonextensive

case. That is, we would like to find the values of 〈um〉q for which Iq(l‖p) is minimum.

We write the following useful relations before we proceed to the derivation.

We can write the generalized minimum entropy distribution (4.28) as

p(x) =elnq r(x)q ⊗q eq

− Mm=1 βmum(x)

Zq=eq

− Mm=1 βmum(x)+lnq r(x)

Zq, (4.60)

by using the relations elnq xq = x and exq ⊗q eyq = e

x+yq . Further by using

lnq(xy) = lnq x+ lnq y + (1 − q) lnq x lnq y

we can write (4.60) as

lnq p(x)+lnq Zq+(1−q) lnq p(x) lnq Zq = −M∑

m=1

βmum(x)+lnq r(x) .(4.61)

By the property of q-logarithm

lnq

(x

y

)= yq−1(lnq x− lnq y) , (4.62)

and by q-logarithmic representations of Tsallis entropy,

Sq = −∫

Xp(x)q lnq p(x) dµ(x) ,

90

Page 100: Thesis Entropy Very Impotant

one can verify that

Iq(p‖r) = −∫

Xp(x)q lnq r(x) dµ(x) − Sq(p) . (4.63)

With these relations in hand we proceed with the derivation. Consider

Iq(l‖p) = −∫

Xl(x) lnq

p(x)

l(x)dµ(x) .

By (4.62) we have

Iq(l‖p) = −∫

Xl(x)q

[lnq p(x) − lnq l(x)

]dµ(x)

= Iq(l‖r) −∫

Xl(x)q

[lnq p(x) − lnq r(x)

]dµ(x) . (4.64)

From (4.61), we get

Iq(l‖p) = Iq(l‖r)+∫

Xl(x)q

[M∑

m=1

βmum(x)

]dµ(x)

+ lnq Zq

Xl(x)q dµ(x)

+(1 − q) lnq Zq

Xl(x)q lnq p(x) dµ(x) . (4.65)

By using (4.57) and (4.63),

Iq(l‖p) = Iq(l‖r) +

M∑

m=1

βm〈wm〉q + lnq Zq

Xl(x)q dµ(x)

+(1 − q) lnq Zq

[− Iq(l‖p) − Sq(l)

], (4.66)

and by the expression of Tsallis entropy Sq(l) = 1q−1

[1 −

∫X l(x)

q dµ(x)], we have

Iq(l‖p) = Iq(l‖r) +

M∑

m=1

βm〈wm〉q + lnq Zq − (1 − q) lnq ZqIq(l‖p) . (4.67)

Since the multipliers βm, m = 1, . . . M are functions of the expected values 〈um〉q,variations in the expected values are equivalent to variations in the multipliers. Hence,

to find the minimum of Iq(l‖p), we solve

∂βmIq(l‖p) = 0 . (4.68)

By using thermodynamic equation (4.34), solution of (4.68) provides us with the

expectation matching property in the nonextensive case as

〈um〉q =〈wm〉q

1 − (1 − q)Iq(l‖p), m = 1, . . . M . (4.69)

91

Page 101: Thesis Entropy Very Impotant

In the limit q → 1 the above equation gives 〈um〉1 = 〈wm〉1 which is the expectation

matching property in the classical case.

Now, to derive the triangle equality for Tsallis relative-entropy minimization, we

substitute the expression for 〈wm〉q, which is given by (4.69), in (4.67). And after

some algebra one can arrive at (4.59).

Note that the limit q → 1 in (4.59) gives the triangle equality in the classical

case (4.54). The two important cases which arise out of (4.59) are,

Iq(l‖r) ≤ Iq(l‖p) + Iq(p‖r) when 0 < q ≤ 1 , (4.70)

Iq(l‖r) ≥ Iq(l‖p) + Iq(p‖r) when 1 < q . (4.71)

We refer to Theorem 4.3 as nonextensive Pythagoras’ theorem and (4.59) as nonex-

tensive triangle equality, whose pseudo-additivity property is consistent with the pseudo

additivity of Tsallis relative-entropy (compare (2.40) and (2.11)), and hence is a natural

generalization of triangle equality in the classical case.

4.3.3 In the Case of Normalized q-Expectations

In the case of normalized q-expectation too, the Tsallis relative-entropy satisfies nonex-

tensive triangle equality with modified conditions from the case of q-expectation val-

ues.

THEOREM 4.4 Let r be the prior distribution, and p be the probability distribution that minimizes

the Tsallis relative-entropy subject to the set of constraints∫X um(x)p(x)q dµ(x)∫

X p(x)q dµ(x)

= 〈〈um〉〉q ,m = 1, . . . ,M. (4.72)

Let l be any other distribution satisfying the constraints∫X um(x)l(x)q dµ(x)∫

X l(x)q dµ(x)

= 〈〈wm〉〉q ,m = 1, . . . ,M. (4.73)

Then we have

Iq(l‖r) = Iq(l‖p) + Iq(p‖r) + (q − 1)Iq(l‖p)Iq(p‖r), (4.74)

provided

〈〈um〉〉q = 〈〈wm〉〉qm = 1, . . . M. (4.75)

92

Page 102: Thesis Entropy Very Impotant

Proof From Tsallis minimum entropy distribution p in the case of normalized q-expected

values (4.40), we have

lnq r(x) − lnq p(x) = lnq Zq+ (1 − q) lnq p(x) lnq Zq

+

∑Mm=1 βm

(um(x) − 〈〈um〉〉q

)

∫X p(x)

q dµ(x). (4.76)

Proceeding as in the proof of Theorem 4.3, we have

Iq(l‖p) = Iq(l‖r) −∫

Xl(x)q

[lnq p(x) − lnq r(x)

]dµ(x) . (4.77)

From (4.76), we obtain

Iq(l‖p) = Iq(l‖r) + lnq Zq

Xl(x)q dµ(x)

+(1 − q) lnq Zq

Xl(x)q lnq p(x) dµ(x)

+1∫

X p(x)q dµ(x)

M∑

m=1

βm

Xl(x)q

(um(x) − 〈〈um〉〉q

)dµ(x) .

(4.78)

By (4.73) the same can be written as

Iq(l‖p) = Iq(l‖r) + lnq Zq

Xl(x)q dµ(x)

+(1 − q) lnq Zq

Xl(x)q lnq p(x) dµ(x)

+

∫X l(x)

q dµ(x)∫X p(x)

q dµ(x)

M∑

m=1

βm

(〈〈wm〉〉q − 〈〈um〉〉q

).

(4.79)

By using the relations∫

Xl(x)q lnq p(x) dµ(x) = −Iq(l‖p) − Sq(l) ,

and∫

Xl(x)q dµ(x) = (1 − q)Sq(l) + 1 ,

(4.79) can be written as

Iq(l‖p) = Iq(l‖r) + lnq Zq − (1 − q) lnq ZqIq(l‖p)

+

∫X l(x)

q dµ(x)∫X p(x)

q dµ(x)

M∑

m=1

βm

(〈〈wm〉〉q − 〈〈um〉〉q

). (4.80)

Finally using (4.41) and (4.75) we have the nonextensive triangle equality (4.74).

93

Page 103: Thesis Entropy Very Impotant

Note that in this case the minimum of Iq(l‖p) is not guaranteed. Also the condition

(4.75) for nonextensive triangle equality here is the same as the expectation value

matching property in the classical case.

Finally, nonextensive Pythagoras’ theorem is yet another remarkable and consis-

tent generalization shown by Tsallis formalism.

94

Page 104: Thesis Entropy Very Impotant

5 Power-laws and Entropies:Generalization of Boltzmann Selection

Abstract

The great success of Tsallis formalism is due to the resulting power-law distribu-tions from ME-prescriptions of its entropy functional. In this chapter we provideexperimental demonstration of use of the power-law distributions in evolutionaryalgorithms by generalizing Boltzmann selection to the Tsallis case. The proposed al-gorithm uses Tsallis canonical distribution to weigh the configurations for ’selection’instead of Gibbs-Boltzmann distribution. This work is motivated by the recentlyproposed generalized simulated annealing algorithm based on Tsallis statistics. Theresults in this chapter can also be found in (Dukkipati, Murty, & Bhatnagar, 2005a).

The central step of an enormous variety of problems (in Physics, Chemistry, Statistics,

Engineering, Economics) is the minimization of an appropriate energy or cost func-

tion. (For example, energy function in the traveling salesman problem is the length

of the path.) If the cost function is convex, any gradient descent method easily solves

the problem. But if the cost function is nonconvex the solution requires more sophis-

ticated methods, since a gradient decent procedure could easily trap the system in a

local minimum. Consequently, various algorithmic strategies have been developed

along the years for making this important problem increasingly tractable. Among

the various methods developed to solve hard optimization problems, the most popular

ones are simulated annealing (Kirkpatrick, Gelatt, & Vecchi, 1983) and evolutionary

algorithms (Bounds, 1987).

Evolutionary computation comprises of techniques for obtaining near-optimal so-

lutions of hard optimization problems in physics (e.g., Sutton, Hunter, & Jan, 1994)

and engineering (Holland, 1975). These methods are based largely on ideas from bio-

logical evolution and are similar to simulated annealing, except that, instead of explor-

ing the search space with a single point at each instant, these deal with a population – a

multi-subset of search space – in order to avoid getting trapped in local optima during

the process of optimization. Though evolutionary algorithms are not analyzed tradi-

tionally in the Monte Carlo framework, few researchers (e.g., Cercueil & Francois,

2001; Cerf, 1996a, 1996b) analyzed these algorithms in this framework.

95

Page 105: Thesis Entropy Very Impotant

A typical evolutionary algorithm is a two step process: selection and variation.

Selection comprises replicating an individual in the population based on probabilities

(selection probabilities) that are assigned to individuals in the population on the basis

of a “fitness” measure defined by the objective function. A stochastic perturbation of

individuals while replicating is called variation.

Selection is a central concept in evolutionary algorithms. There are several selec-

tion mechanisms in evolutionary algorithms, among which Boltzmann selection has an

important place because of the deep connection between the behavior of complex sys-

tems in thermal equilibrium at finite temperature and multivariate optimization (Nulton

& Salamon, 1988). In these systems, each configuration is weighted by its Gibbs-

Boltzmann probability factor e−E/T , where E is the energy of the configuration and

T is the temperature. Finding the low-temperature state of a system when the en-

ergy can be computed amounts to solving an optimization problem. This connection

has been used to devise the simulated annealing algorithm (Kirkpatrick et al., 1983).

Similarly for evolutionary algorithms, in the selection process where one would select

“better” configurations, one can use the same technique to weigh the individuals i.e.,

using Gibbs-Boltzmann factor. This is called Boltzmann selection, which is nothing

but defining selection probabilities in the form of Boltzmann canonical distribution.

Classical simulated annealing, as proposed by Kirkpatrick et al. (1983), extended

the well-known procedure of Metropolis et al. (1953) for equilibrium Gibbs-Boltzmann

statistics: a new configuration is accepted with the probability

p = min(1, e−β∆E

), (5.1)

where β = 1T is the inverse temperature parameter and ∆E is the change in the energy.

The annealing consists in decreasing the temperature gradually. Geman and Geman

(1984) showed that if the temperature decreases as the inverse logarithm of time, the

system will end in a global minimum.

On the other hand, in the generalized simulated annealing procedure proposed

by Tsallis and Stariolo (1996) the acceptance probability is generalized to

p = min1, [1 − (1 − q)β∆E]

11−q

, (5.2)

for some q. The term [1 − (1 − q)β∆E]1

1−q is due to Tsallis distribution in Tsallis

statistics (see § 3.4) and q → 1 in (5.2) retrieves the acceptance probability in the

classical case. This method is shown to be faster than both classical simulated an-

nealing and the fast simulated annealing methods (Stariolo & Tsallis, 1995; Tsallis,

96

Page 106: Thesis Entropy Very Impotant

1988). This algorithm has been used successfully in many applications (Yu & Mo,

2003; Moret et al., 1998; Penna, 1995; Andricioaei & Straub, 1996, 1997).

The above described use of power-law distributions in simulated annealing is the

motivation for us to incorporate Tsallis canonical probability distribution for selection

in evolutionary algorithms and test their novelty.

Before we present the proposed algorithm and simulation results, we also present

an information theoretic justification of Boltzmann distribution in selection mecha-

nism (Dukkipati et al., 2005a). In fact, in evolutionary algorithms Boltzmann selec-

tion is viewed just as an exponential scaling for proportionate selection (de la Maza &

Tidor, 1993) (where selection probabilities of configurations are inversely proportional

to their energies (Holland, 1975)). We show that by using Boltzmann distribution in the

selection mechanism one would implicitly satisfy Kullback minimum relative-entropy

principle.

5.1 EAs based on Boltzmann Distribution

Let Ω be the search space i.e., space of all configurations of an optimization prob-

lem. Let E : Ω → R+ be the objective function – following statistical mechanics

terminology (Nulton & Salamon, 1988; Prugel-Bennett & Shapiro, 1994) we refer

to this function as energy (in evolutionary computation terminology this is called as

fitness function) – where the objective is to find a configuration with lowest energy.

Pt = ωkNtk=1 denotes a population which is a multi-subset of Ω. Here we assume

that the size of population at any time is finite and need not be a constant.

In the first step, initial population P0 is chosen with random configurations. At

each time step t, the population undergoes the following procedure.

Ptselection−→ P ′

tvariation−→ Pt+1 .

Variation is nothing but stochastically perturbing the individuals in the population.

Various methods in evolutionary algorithms follow different approaches. For example

in genetic algorithms, where configurations are represented as binary strings, operators

such as mutation and crossover are used; for details see (Holland, 1975).

Selection is the mechanism, where “good” configurations are replicated based on

their selection probabilities (Back, 1994). For a population Pt = ωkNtk=1 with the

corresponding energy values Eknk=1, selection probabilities are defined as

pt(ωk) = Prob(ωk ∈ P ′t |ωk ∈ Pt) , ∀k = 1 . . . Nt ,

97

Page 107: Thesis Entropy Very Impotant

Apply Selection

no

yes

start

Stop Criterion

Randomly Vary Individuals

Evaluate "Fitness"

Initialize Population

end

Figure 5.1: Structure of evolutionary algorithms

and pt(ωk)Ntk=1 satisfies the condition:∑Nt

k=1 pt(ωk) = 1.

The general structure of evolutionary algorithms is shown in Figure 5.1; for further

details refer to (Fogel, 1994; Back, Hammel, & Schwefel, 1997).

According to Boltzmann selection, selection probabilities are defined as

pt(ωk) =e−βEk

∑Ntj=1 e

−βEj, (5.3)

where β is the inverse temperature at time t. The strength of selection is controlled by

the parameter β. A higher value of β (low temperature) gives a stronger selection, and

a lower value of β gives a weaker selection (Back, 1994).

Boltzmann selection gives faster convergence, but without a good annealing sched-

ule for β, it might lead to premature convergence. This problem is well known from

simulated annealing (Aarts & Korst, 1989), but not very well studied in evolutionary

algorithms. This problem is addressed in (Mahnig & Muhlenbein, 2001; Dukkipati,

Murty, & Bhatnagar, 2004) where annealing schedules for evolutionary algorithms

based on Boltzmann selection have been proposed.

Now, we derive the selection equation, similar to the one derived in (Dukkipati

98

Page 108: Thesis Entropy Very Impotant

et al., 2004), which characterizes Boltzmann selection from first principles. Given a

population Pt = ωkNtk=1, the simplest probability distribution on Ω which represents

Pt is

ξt(ω) =νt(ω)

Nt, ∀ω ∈ Ω , (5.4)

where the function νt : Ω → Z+ ∪ 0 measures the number of occurrences of each

configuration ω ∈ Ω in population Pt. Formally νt can be defined as

νt(ω) =

Nt∑

k=1

δ(ω, ωk) , ∀ω ∈ Ω , (5.5)

where δ : Ω × Ω → 0, 1 is defined as δ(ω1, ω2) = 1 if ω1 = ω2, δ(ω1, ω2) = 0

otherwise.

The mechanism of selection involves assigning selection probabilities to the con-

figurations in Pt as in (5.3) and sample configurations based on selection probabilities

to generate the population Pt+1. That is, selection probability distribution assigns zero

probability to the configurations which are not present in the population. Now from

the fact that population is a multi-subset of Ω,we can write selection probability distri-

bution with respect to population Pt as,

p(ω) =

νt(ω)e−βE(ω)ω∈Pt

νt(ω)e−βE(ω) , if ω ∈ Pt ,

0 otherwise.

(5.6)

One can estimate the frequencies of configurations after the selection νt+1 as

νt+1(ω) = νt(ω)e−βE(ω)

∑ω∈Pt νt(ω)e−βE(ω)

Nt+1 , (5.7)

where Nt+1 is the population size after the selection. Further, the probability distribu-

tion which represents the population Pt+1 can be estimated as

ξt+1(ω) =νt+1(ω)

Nt+1= νt(ω)

e−βE(ω)

∑ω∈Pt νt(ω)e−βE(ω)

= ξt(ω)Nte−βE(ω)

∑ω∈Pt ξt(ω)Nte−βE(ω)

.

Finally, we can write the selection equation as

ξt+1(ω) =ξt(ω)e−βE(ω)

∑ω∈Pt ξt(ω)e−βE(ω)

. (5.8)

99

Page 109: Thesis Entropy Very Impotant

One can observe that (5.8) resembles the minimum relative-entropy distribution

that we derived in § 4.1.1 (see 4.5). This motivates one to investigate the possible

connection of Boltzmann selection with the Kullback’s relative-entropy principle.

Given the distribution ξt, which represents the population Pt, we would like to

estimate the distribution ξt+1 that represents the population Pt+1. In this context one

can view ξt as a prior estimate of ξt+1. The available constraints for ξt+1 are

w∈Ω

ξt+1(ω) = 1 , (5.9a)

w∈Ω

ξt+1(ω)E(ω) = 〈E〉t+1 , (5.9b)

where 〈E〉t+1 is the expected value of the function E with respect to ξt+1. At this

stage let us assume that 〈E〉t+1 is a given quantity; this will be explained later.

In this set up, Kullback minimum relative-entropy principle gives the estimate for

ξt+1. That is, one should choose ξt+1 in such a way that it minimizes the relative-

entropy

I(ξt+1‖ξt) =∑

ω∈Ω

ξt+1(ω) lnξt+1(ω)

ξt(ω)(5.10)

with respect to the constraints (5.9a) and (5.9b). The corresponding Lagrangian can

be written as

L ≡ −I(ξt+1‖ξt)−(λ− 1)

(∑

ω∈Ω

ξt+1(ω) − 1

)

−β(∑

ω∈Ω

E(ω)ξt+1(ω) − 〈E〉t+1

),

where λ and β are Lagrange parameters and

∂L

∂ξt+1(ω)= 0 =⇒ ξt+1(ω) = eln ξt(ω)−λ−βE(ω) .

By (5.9a) we get

ξt+1(ω) =eln ξt(ω)−βE(ω)

∑ω e

ln ξt(ω)−βE(ω)=

ξt(ω)e−βE(ω)

∑ω∈Pt ξt(ω)e−βE(ω)

, (5.11)

which is the selection equation (5.8) that we have derived from the Boltzmann selec-

tion mechanism. The Lagrange multiplier β is the inverse temperature parameter in

Boltzmann selection.

100

Page 110: Thesis Entropy Very Impotant

The above justification is incomplete without explaining the relevance of the con-

straint (5.9b) in this context. Note that the inverse temperature parameter β in (5.11)

is determined using constraint (5.9b). Thus we have

∑ω∈Ω E(ω)ξt(ω)e−βE(ω)

∑ω∈Ω ξt(ω)e−βE(ω)

= 〈E〉t+1 . (5.12)

Now it is evident that by specifying β in the annealing schedule of Boltzmann selec-

tion, we predetermine 〈E〉t+1, which is the mean of the function E with respect to the

population Pt+1, according to which the configurations for Pt+1 are sampled.

Now with this information theoretic justification of Boltzmann selection we pro-

ceed to its generalization to the Tsallis case.

5.2 EA based on Power-law Distributions

We propose a new selection scheme for evolutionary algorithms based on Tsallis gen-

eralized canonical distribution, that results from maximum entropy prescriptions of

Tsallis entropy discussed in § 3.4 as follows. For a population P (t) = ωkNtk=1 with

corresponding energies EkNtk=1 we define selection probabilities as

pt(ωk) =[1 − (1 − q)βtEk]

11−q

∑Ntj=1 [1 − (1 − q)βtEj]

11−q

, ∀k = 1, . . . Nt , (5.13)

where βt : t = 1, 2, . . . is an annealing schedule. We refer to the selection scheme

based on Tsallis distribution as Tsallis selection and the evolutionary algorithm with

Tsallis selection as generalized evolutionary algorithm.

In this algorithm, we use the Cauchy annealing schedule that is proposed in (Dukkipati

et al., 2004). This annealing schedule chooses βt as a non-decreasing Cauchy sequence

for faster convergence. One such sequence is

βt = β0

t∑

i=1

1

iα, t = 1, 2, . . . , (5.14)

where β0 is any constant and α > 1. The novelty of this annealing schedule has been

demonstrated using simulations in (Dukkipati et al., 2004). Similar to the practice

in generalized simulated annealing (Andricioaei & Straub, 1997), in our algorithm, q

tends towards 1 as temperature decreases during annealing.

The generalized evolutionary algorithm based on Tsallis statistics is given in Fig-

ure 5.2.

101

Page 111: Thesis Entropy Very Impotant

Algorithm 1 Generalized Evolutionary algorithm

P (0)← Initialize with configurations from search space randomly

Initialize β and q

for t = 1 to T do

for all ω ∈ P (t) do

(Selection)Calculate

p(ω) =[1− (1− q)βE(ω)]

1

1−q

Zq

Copy ω into P ′(t) with probability p(ω) with replacement

end for

for all ω ∈ P ′(t) do

(Variation)Perform variation with specific probability

end for

Update β according to annealing schedule

Update q according to its schedule

P (t + 1)← P ′(t)end for

Figure 5.2: Generalized Evolutionary Algorithm based on Tsallis statistics to optimizethe energy function E(ω).

5.3 Simulation Results

We discuss the simulations conducted to study the generalized evolutionary algo-

rithm based on Tsallis statistics proposed in this paper. We compare performance

of evolutionary algorithms with three selection mechanisms viz., proportionate se-

lection (where selection probabilities of configurations are inversely proportional to

their energies (Holland, 1975)), Boltzmann selection and Tsallis selection respectively.

For comparison purposes we study multi-variate function optimization in the frame-

work of genetic algorithms. Specifically, we use the following bench mark test func-

tions (Muhlenbein & Schlierkamp-Voosen, 1993), where the aim is to find the config-

uration with the lowest functional value:

• Ackley’s function:

E1(~x) = −20 exp

(−0.2

√1l

∑li=1 xi

2

)−exp

(1l

∑li=1 cos(2πxi)

)+20+e ,

where −30 ≤ xi ≤ 30,

• Rastrigin’s function:

E2(~x) = lA+∑l

i=1 x2i −A cos(2πxi),

where A = 10 ; −5.12 ≤ xi ≤ 5.12,

102

Page 112: Thesis Entropy Very Impotant

• Griewangk’s function:

E3(~x) =∑l

i=1xi2

4000 −∏li=1 cos

(xi√i

)+ 1,

where −600 ≤ xi ≤ 600.

Parameters for the algorithms were set to compare performance of these algorithms

in identical conditions. Each xi is encoded with 5 bits and l = 15 i.e., search space is

of size 275. Population size is n = 350. For all the experiments, probability of uniform

crossover is 0.8 and probability of mutation is below 0.1. We limited each algorithm

to 100 iterations and have given plots for the behavior of the process when averaged

over 20 runs.

As we mentioned earlier, for Boltzmann selection we have used the Cauchy an-

nealing schedule (see (5.14)), in which we set β0 = 200 and α = 1.01. For Tsallis

selection too, we have used the same annealing schedule as Boltzmann selection with

identical parameters. In our preliminary simulations, q was kept constant and tested

with various values. Then we adopted a strategy from generalized simulated annealing

where one would choose an initial value of q0 and decrease linearly to the value 1. This

schedule of q gave better performance than keeping it constant. We reported results

with various values of q0.

14

15

16

17

18

19

20

0 10 20 30 40 50 60 70 80 90 100

best

fitne

ss

generations

q0 = 3q0 = 2

q0 = 1.5 q0 = 1.01

Figure 5.3: Performance of evolutionary algorithm with Tsallis selection for variousvalues of q0 for the test function Ackley

From various simulations, we observed that when the problem size is small (for

example smaller values of l) all the selection mechanisms perform equally well. Boltz-

mann selection is effective when we increase the problem size. For Tsallis selection,

we performed simulations with various values of q0. Figure 5.3 shows the performance

for Ackley function for q0 = 3, 2, 1.5 and 1.01, respectively, from which one can see

103

Page 113: Thesis Entropy Very Impotant

14

15

16

17

18

19

20

0 10 20 30 40 50 60 70 80 90 100

best

_fitn

ess

generations

proportionateBoltzmann

Tsallis

Figure 5.4: Ackley: q0 = 1.5

100

110

120

130

140

150

160

170

180

0 10 20 30 40 50 60 70 80 90 100

best

_fitn

ess

generations

proportionateBoltzmann

Tsallis

Figure 5.5: Rastrigin: q0 = 2

60

80

100

120

140

160

180

200

0 10 20 30 40 50 60 70 80 90 100

best

_fitn

ess

generations

proportionateBoltzmann

Tsallis

Figure 5.6: Griewangk: q0 = 1.01

104

Page 114: Thesis Entropy Very Impotant

that the choice of q0 is very important for the evolutionary algorithm with Tsallis se-

lection which varies with the problem at hand.

Figures 5.4, 5.5 and 5.6 show the comparisons of evolutionary algorithms based

on Tsallis selection, Boltzmann selection and proportionate selection, respectively, for

different functions. We have reported only the best behavior for various values of q0.

From these simulation results, we conclude that the evolutionary algorithm based on

Tsallis canonical distribution with appropriate value of q0 outperforms those based on

Boltzmann and proportionate selection respectively.

105

Page 115: Thesis Entropy Very Impotant

6 Conclusions

Abstract

In this concluding chapter we summarize the results of the Dissertation, with anemphasis on novelties, and new problems suggested by this research.

Information theory based on Shannon entropy functional found applications that cut

across a myriad of fields, because of its established mathematical significance i.e., its

beautiful mathematical properties. Shannon (1956) too emphasized that “the hard core

of information theory is, essentially, a branch of mathematics” and “a thorough under-

standing of the mathematical foundation . . . is surely a prerequisite to other applica-

tions.” Given that “the hard core of information theory is a branch of mathematics,”

one could expect formal generalizations of information measures taking place, just as

would be the case for any other mathematical concept.

At the outset of this Dissertation we noted from (Renyi, 1960; Csiszar, 1974) that

generalization of information measures should be indicated by their operational sig-

nificance (pragmatic approach) and by a set of natural postulates characterizing them

(axiomatic approach). In the literature ranging from mathematics to physics, infor-

mation theory to machine learning one can find various operational and axiomatic

justifications of the generalized information measures. In this thesis, we investigated

some properties of generalized information measures and their maximum and mini-

mum entropy prescriptions pertaining to their mathematical significance.

6.1 Contributions of the Dissertation

In this section we briefly summarize the contributions of this thesis including some

problems suggested by this work.

Renyi’s recipe for nonextensive information measures

Passing an information measure through Renyi’s formalism – a procedure followed by

Renyi to generalize Shannon entropy – allows one to study the possible generaliza-

106

Page 116: Thesis Entropy Very Impotant

tions and characterizations of information measure in terms of axioms of quasilinear

means. In Chapter 2, we studied this technique for nonextensive entropy and showed

that Tsallis entropy is unique under Renyi’s recipe. Assuming that any putative candi-

date for an entropy should be a mean (Renyi, 1961), and in light of attempts to study

ME-prescriptions of information measures, where constraints are specified using KN-

averages (e.g., Czachor & Naudts, 2002), the results presented in this thesis further the

relation between entropy functionals and generalized means.

Measure-theoretic formulations

In Chapter 3, we extended the discrete case definitions of generalized information

measures to the measure-theoretic case. We showed that as in the case of Kullback-

Leibler relative-entropy, generalized relative-entropies, whether Renyi or Tsallis, in

the discrete case can be naturally extended to measure-theoretic case, in the sense

that measure-theoretic definitions can be derived from limits of sequences of finite

discrete entropies of pmfs which approximate the pdfs involved. We also showed that

ME prescriptions of measure-theoretic Tsallis entropy are consistent with the discrete

case, which is also true for measure-theoretic Shannon-entropy.

GYP-theorem

Gelfand-Yaglom-Perez theorem for KL-entropy not only equips it with a fundamental

definition but also provides a means to compute KL-entropy and study its behavior. We

stated and proved the GYP-theorem for generalized relative entropies of order α > 1

(q > 1 for the Tsallis case) in Chapter 3. However, results for the case 0 < α < 1, are

yet to be obtained.

q-product representation of Tsallis minimum entropy distribution

Tsallis relative-entropy minimization in both the cases, q-expectations and normalized

q-expectations, has been studied and some significant differences with the classical

case are presented in Chapter 4. We showed that unlike in the classical case, minimiz-

ing Tsallis relative-entropy is not equivalent to maximizing entropy when the prior is a

uniform distribution. Our usage of q-product in the representation of Tsallis minimum

entropy distributions, not only provides it with an elegant representation but also sim-

plifies the calculations in the study of its properties and in deriving the expressions for

minimum relative-entropy and corresponding thermodynamic equations.

107

Page 117: Thesis Entropy Very Impotant

The detailed study of Tsallis relative-entropy minimization in the case of nor-

malized q-expected values and the computation of corresponding minimum relative-

entropy distribution (where one has to address the self-referential nature of the prob-

abilities) based on Tsallis et al. (1998), Martınez et al. (2000) formalisms for Tsallis

entropy maximization is currently under investigation. Considering the various fields

to which Tsallis generalized statistics has been applied, studies of applications of Tsal-

lis relative minimization of various inference problems are of particular relevance.

Nonextensive Pythagoras’ theorem

Phythagoras’ theorem of relative-entropy plays an important role in geometrical ap-

proaches of statistical estimation theory like information geometry. In Chapter 4 we

proved Pythagoras’ theorem in the nonextensive case i.e., for Tsallis relative-entropy

minimization. In our opinion, this result is yet another remarkable and consistent gen-

eralization shown by the Tsallis formalism.

Use of power-law distributions in EAs

Inspired by the generalization of simulated annealing reported by (Tsallis & Stariolo,

1996), in Chapter 5 we proposed a generalized evolutionary algorithm based on Tsal-

lis statistics. The algorithm uses Tsallis canonical probability distribution instead of

Boltzmann distribution. Since these distributions are maximum entropy distributions,

we presented the information theoretical justification to use Boltzmann selection in

evolutionary algorithms – prior to this, Boltzmann selection was viewed only as a spe-

cial case of proportionate selection with exponential scaling. This should encourage

the use of information theoretic methods in evolutionary computation.

We tested our algorithm on some bench-mark test functions. We found that with

an appropriate choice of nonextensive index (q), evolutionary algorithms based on

Tsallis statistics outperform those based on Gibbs-Boltzmann distribution. We believe

the Tsallis canonical distribution is a powerful technique for selection in evolutionary

algorithms.

6.2 Future Directions

There are two fundamental spaces in machine learning. The first space X consists of

data points and the second space Θ consists of possible learning models. In statistical

108

Page 118: Thesis Entropy Very Impotant

learning, Θ is usually a space of statistical models, p(x; θ) : θ ∈ Θ in the generative

case or p(y|x; θ) : θ ∈ Θ in the discriminative case. Learning algorithms select a

model θ ∈ Θ based on the training example xknk=1 ⊂ X or (xk, yk)nk=1 ⊂ X × Y

depending on whether the generative case or the discriminative case are considered.

Applying differential geometry, a mathematical theory of geometries, in smooth,

locally Euclidean spaces to space of probability distributions and so to statistical mod-

els is a fundamental technique in information geometry. Information does however

play two roles in it: Kullback-Leibler relative entropy features as a measure of diver-

gence, and Fisher information takes the role of curvature.

ME-principle is involved in information geometry due to the following reasons.

One is Pythagoras’ theorem of relative-entropy minimization. And the other is due to

the work of Amari (2001). Amari showed that ME distributions are exactly the ones

with minimal interaction between their variables — these are close to independence.

This result plays an important role in geometric approaches to machine learning.

Now, equipped with the nonextensive Pythagoras’ theorem in the generalized case

of Tsallis, it is interesting to know the resultant geometry when we use generalized

information measures and role of entropic index in the geometry.

Another open problem in generalized information measures is the kind of con-

straints one should use for the ME-prescriptions. At present ME-prescriptions for

Tsallis come in three flavors. These three flavors correspond to the kind of constraints

one would use to derive the canonical distribution. The first is conventional expectation

(Tsallis (1988)), second is q-expectation values (Curado-Tsallis (1991)), and the third

is normalized q-expectation values (Tsallis-Mendes-Plastino (1998)). The problem of

which constraints to use remains an open problem that has so far been addressed only

in the context of thermodynamics.

Boghosian (1996) suggested that the entropy functional and the constraints one

would use should be considered as axioms. By this he suggested that their validity

is to be decided solely by the conclusions to which they lead and ultimately by com-

parison with experiment. A practical study of it in the problems related to estimating

probability distributions by using ME of Tsallis entropy might throw some light.

Moving on to another problem, we have noted that Tsallis entropy can be written

as a Kolmogorov-Nagumo function of Renyi entropy. We have also seen that the same

function is KN-equivalent to the function which is used in the generalized averaging of

Hartley information to derive Renyi entropy. This suggests the possibility that gener-

alized averages play a role in describing the operational significance of Tsallis entropy,

109

Page 119: Thesis Entropy Very Impotant

an explanation for which still eludes us.

Finally, though Renyi information measure offers very natural – and perhaps con-

ceptually the cleanest – setting for generalization of entropy, and while generalization

of Tsallis entropy too can be put in some what formal setting with q-generalizations of

functions – we still are not in the know about the complete relevance, in the sense of

operational, axiomatic, mathematical, of entropic indexes α in Renyi and q in Tsallis.

This is easily the most challenging problem before us.

6.3 Concluding Thought

Mathematical formalism plays an important role not only in physical theories but also

in theories of information phenomena; some undisputed examples being the Shannon

theory of information and Kolmogorov theory of complexity. One can make advances

further in these theories by, as Dirac (1939, 1963) suggested for the advancement of

theoretical physics, employing all the resources of pure mathematics in an attempt to

perfect and generalize the existing mathematical formalism.

While operational and axiomatic justifications lay the foundations, the study of

“mathematical significance” of these generalized concepts forms the pillars on which

one can develop the generalized theory. The ultimate fruits of this labour include, a

better understanding of phenomena in the context, better solutions for related practical

problems – perhaps, as Wigner (1960) called unreasonable effectiveness of mathemat-

ics – and finally, its own beauty.

110

Page 120: Thesis Entropy Very Impotant

Bibliography

Aarts, E., & Korst, J. (1989). Simulated Annealing and Boltzmann Machines–A

Stochastic Approach to Combinatorial Optimization and Neural Computing.

Wiley, New York.

Abe, S., & Suzuki, N. (2004). Scale-free network of earthquakes. Europhysics Letters,

65(4), 581–586.

Abe, S. (2000). Axioms and uniqueness theorem for Tsallis entropy. Physics Letters

A, 271, 74–79.

Abe, S. (2003). Geometry of escort distributions. Physical Review E, 68, 031101.

Abe, S., & Bagci, G. B. (2005). Necessity of q-expectation value in nonextensive

statistical mechanics. Physical Review E, 71, 016139.

Aczel, J. (1948). On mean values. Bull. Amer. Math. Soc., 54, 392–400.

Aczel, J., & Daroczy, Z. (1975). On Measures of Information and Their Characteri-

zation. Academic Press, New York.

Agmon, N., Alhassid, Y., & Levine, R. D. (1979). An algorithm for finding the distri-

bution of maximal entropy. Journal of Computational Physics, 30, 250–258.

Amari, S. (2001). Information geometry on hierarchy of probability distributions.

IEEE Transactions on Information Theory, 47, 1701–1711.

Amari, S. (1985). Differential-Geometric Methods in Statistics, Vol. 28 of Lecture

Notes in Statistics. Springer-Verlag, Heidelberg.

Amari, S., & Nagaoka, H. (2000). Methods of Information Geometry, Vol. 191 of

Translations of Mathematical Monographs. Oxford University Press, Oxford.

Amblard, P.-O., & Vignat, C. (2005). A note on bounded entropies. arXiv:cond-

mat/0509733.

Andricioaei, I., & Straub, J. E. (1996). Generalized simulated annealing algorithms us-

ing Tsallis statistics: Application to conformational optimization of a tetrapep-

tide. Physical Review E, 53(4), 3055–3058.

111

Page 121: Thesis Entropy Very Impotant

Andricioaei, I., & Straub, J. E. (1997). On Monte Carlo and molecular dynamics meth-

ods inspired by Tsallis statistics: Methodology, optimization, and application to

atomic clusters. J. Chem. Phys., 107(21), 9117–9124.

Arimitsu, T., & Arimitsu, N. (2000). Tsallis statistics and fully developed turbulence.

J. Phys. A: Math. Gen., 33(27), L235.

Arimitsu, T., & Arimitsu, N. (2001). Analysis of turbulence by statistics based on

generalized entropies. Physica A, 295, 177–194.

Arndt, C. (2001). Information Measures: Information and its Description in Science

and Engineering. Springer, Berlin.

Ash, R. B. (1965). Information Theory. Interscience, New York.

Athreya, K. B. (1994). Entropy maximization. IMA preprint series 1231, Institute for

Mathematics and its Applications, University of Minnesota, Minneapolis.

Back, T. (1994). Selective pressure in evolutionary algorithms: A characterization of

selection mechanisms. In Proceedings of the First IEEE Conference on Evolu-

tionary Computation, pp. 57–62 Piscataway, NJ. IEEE Press.

Back, T., Hammel, U., & Schwefel, H.-P. (1997). Evolutionary computation: Com-

ments on the history and current state. IEEE Transactions on Evolutionary Com-

putation, 1(1), 3–17.

Barabasi, A.-L., & Albert, R. (1999). Emergence of scaling in random networks.

Science, 286, 509–512.

Barlow, H. (1990). Conditions for versatile learning, Helmholtz’s unconscious infer-

ence and the test of perception. Vision Research, 30, 1561–1572.

Bashkirov, A. G. (2004). Maximum Renyi entropy principle for systems with power-

law hamiltonians. Physical Review Letters, 93, 130601.

Ben-Bassat, M., & Raviv, J. (1978). Renyi’s entropy and the probability of error. IEEE

Transactions on Information Theory, IT-24(3), 324–331.

Ben-Tal, A. (1977). On generalized means and generalized convex functions. Journal

of Optimization: Theory and Application, 21, 1–13.

Bhattacharyya, A. (1943). On a measure on divergence between two statistical popu-

lations defined by their probability distributions. Bull. Calcutta. Math. Soc., 35,

99–109.

112

Page 122: Thesis Entropy Very Impotant

Bhattacharyya, A. (1946). On some analogues of the amount of information and their

use in statistical estimation. Sankhya, 8, 1–14.

Billingsley, P. (1960). Hausdorff dimension in probability theory. Illinois Journal of

Mathematics, 4, 187–209.

Billingsley, P. (1965). Ergodic Theory and Information. John Wiley & Songs, Toronto.

Boghosian, B. M. (1996). Thermodynamic description of the relaxation of two-

dimensional turbulence using Tsallis statistics. Physical Review E, 53, 4754.

Borges, E. P. (2004). A possible deformed algebra and calculus inspired in nonexten-

sive thermostatistics. Physica A, 340, 95–101.

Borland, L., Plastino, A. R., & Tsallis, C. (1998). Information gain within nonexten-

sive thermostatistics. Journal of Mathematical Physics, 39(12), 6490–6501.

Bounds, D. G. (1987). New optimization methods from physics and biology. Nature,

329, 215.

Campbell, L. L. (1965). A coding theorem and Renyi’s entropy. Information and

Control, 8, 423–429.

Campbell, L. L. (1985). The relation between information theory and the differential

geometry approach to statistics. Information Sciences, 35(3), 195–210.

Campbell, L. L. (1992). Minimum relative entropy and Hausdorff dimension. Internat.

J. Math. & Stat. Sci., 1, 35–46.

Campbell, L. L. (2003). Geometric ideas in minimum cross-entropy. In Karmeshu

(Ed.), Entropy Measures, Maximum Entropy Principle and Emerging Applica-

tions, pp. 103–114. Springer-Verlag, Berlin Heidelberg.

Caticha, A., & Preuss, R. (2004). Maximum entropy and Bayesian data analysis:

Entropic prior distributions. Physical Review E, 70, 046127.

Cencov, N. N. (1982). Statistical Decision Rules and Optimal Inference, Vol. 53 of

Translations of Mathematical Monographs. Amer. Math. Soc., Providence RI.

Cercueil, A., & Francois, O. (2001). Monte Carlo simulation and population-based op-

timization. In Proceedings of the 2001 Congress on Evolutionary Computation

(CEC2001), pp. 191–198. IEEE Press.

113

Page 123: Thesis Entropy Very Impotant

Cerf, R. (1996a). The dynamics of mutation-selection algorithms with large population

sizes. Ann. Inst. H. Poincare, 32, 455–508.

Cerf, R. (1996b). A new genetic algorithm. Ann. Appl. Probab., 6, 778–817.

Cherney, A. S., & Maslov, V. P. (2004). On minimization and maximization of entropy

in various disciplines. SIAM journal of Theory of Probability and Its Applica-

tions, 48(3), 447–464.

Chew, S. H. (1983). A generalization of the quasilinear mean with applications to

the measurement of income inequality and decision theory resolving the allais

paradox. Econometrica, 51(4), 1065–1092.

Costa, J. A., Hero, A. O., & Vignat, C. (2002). A characterization of the multivariate

distributions maximizing Renyi entropy. In Proceedings of IEEE International

Symposium on Information Theory(ISIT), pp. 263–263. IEEE Press.

Cover, T. M., & Thomas, J. A. (1991). Elements of Information Theory. Wiley, New

York.

Cover, T. M., Gacs, P., & Gray, R. M. (1989). Kolmogorov’s contributions to infor-

mation theory and algorithmic complexity. The Annals of Probability, 17(3),

840–865.

Csiszar, I. (1969). On generalized entropy. Studia Sci. Math. Hungar., 4, 401–419.

Csiszar, I. (1974). Information measures: A critical survey. In Information The-

ory, Statistical Decision Functions and Random Processes, Vol. B, pp. 73–86.

Academia Praha, Prague.

Csiszar, I. (1975). I-divergence of probability distributions and minimization prob-

lems. Ann. Prob., 3(1), 146–158.

Curado, E. M. F., & Tsallis, C. (1991). Generalized statistical mechanics: Connections

with thermodynamics. J. Phys. A: Math. Gen., 24, 69–72.

Czachor, M., & Naudts, J. (2002). Thermostatistics based on Kolmogorov-Nagumo

averages: Unifying framework for extensive and nonextensive generalizations.

Physics Letters A, 298, 369–374.

Daroczy, Z. (1970). Generalized information functions. Information and Control, 16,

36–51.

114

Page 124: Thesis Entropy Very Impotant

Davis, H. (1941). The Theory of Econometrics. Principia Press, Bloomington, IN.

de Finetti, B. (1931). Sul concetto di media. Giornale di Istituto Italiano dei Attuarii,

2, 369–396.

de la Maza, M., & Tidor, B. (1993). An analysis of selection procedures with particular

attention paid to proportional and Boltzmann selection. In Forrest, S. (Ed.),

Proceedings of the Fifth International Conference on Genetic Algorithms, pp.

124–131 San Mateo, CA. Morgan Kaufmann Publishers.

Dirac, P. A. M. (1939). The relation between mathematics and physics. Proceedings

of the Royal Society of Edinburgh, 59, 122–129.

Dirac, P. A. M. (1963). The evolution of the physicist’s picture of nature. Scientific

American, 208, 45–53.

Dobrushin, R. L. (1959). General formulations of Shannon’s basic theorems of the

theory of information. Usp. Mat. Nauk., 14(6), 3–104.

dos Santos, R. J. V. (1997). Generalization of Shannon’s theorem for Tsallis entropy.

Journal of Mathematical Physics, 38, 4104–4107.

Dukkipati, A., Bhatnagar, S., & Murty, M. N. (2006a). Gelfand-Yaglom-Perez theo-

rem for generalized relative entropies. arXiv:math-ph/0601035.

Dukkipati, A., Bhatnagar, S., & Murty, M. N. (2006b). On measure theoretic defini-

tions of generalized information measures and maximum entropy prescriptions.

arXiv:cs.IT/0601080. (Submitted to Physica A).

Dukkipati, A., Murty, M. N., & Bhatnagar, S. (2004). Cauchy annealing sched-

ule: An annealing schedule for Boltzmann selection scheme in evolutionary

algorithms. In Proceedings of the IEEE Congress on Evolutionary Computa-

tion(CEC), Vol. 1, pp. 55–62. IEEE Press.

Dukkipati, A., Murty, M. N., & Bhatnagar, S. (2005a). Information theoretic justi-

fication of Boltzmann selection and its generalization to Tsallis case. In Pro-

ceedings of the IEEE Congress on Evolutionary Computation(CEC), Vol. 2, pp.

1667–1674. IEEE Press.

Dukkipati, A., Murty, M. N., & Bhatnagar, S. (2005b). Properties of Kullback-Leibler

cross-entropy minimization in nonextensive framework. In Proceedings of IEEE

International Symposium on Information Theory(ISIT), pp. 2374–2378. IEEE

Press.

115

Page 125: Thesis Entropy Very Impotant

Dukkipati, A., Murty, M. N., & Bhatnagar, S. (2006a). Nonextensive triangle equality

and other properties of Tsallis relative-entropy minimization. Physica A, 361,

124–138.

Dukkipati, A., Murty, M. N., & Bhatnagar, S. (2006b). Uniqueness of nonextensive

entropy under renyi’s recipe. arXiv:cs.IT/05511078.

Ebanks, B., Sahoo, P., & Sander, W. (1998). Characterizations of Information Mea-

sures. World Scientific, Singapore.

Eggleston, H. G. (1952). Sets of fractional dimension which occur in some problems

of number theory. Proc. London Math. Soc., 54(2), 42–93.

Elsasser, W. M. (1937). On quantum measurements and the role of the uncertainty

relations in statistical mechanics. Physical Review, 52, 987–999.

Epstein, L. G., & Zin, S. E. (1989). Substitution, risk aversion and the temporal behav-

ior of consumption and asset returns: A theoretical framework. Econometrica,

57, 937–970.

Faddeev, D. K. (1986). On the concept of entropy of a finite probabilistic scheme

(Russian). Uspehi Mat. Nauk (N.S), 11, 227–231.

Ferri, G. L., Martınez, S., & Plastino, A. (2005). The role of constraints in Tsallis’

nonextensive treatment revisited. Physica A, 347, 205–220.

Fishburn, P. C. (1986). Implicit mean value and certainty equivalence. Econometrica,

54(5), 1197–1206.

Fogel, D. B. (1994). An introduction to simulated evolutionary optimization. IEEE

Transactions on Neural Networks, 5(1), 3–14.

Forte, B., & Ng, C. T. (1973). On a characterization of the entropies of type β. Utilitas

Math., 4, 193–205.

Furuichi, S. (2005). On uniqueness theorem for Tsallis entropy and Tsallis relative

entropy. IEEE Transactions on Information Theory, 51(10), 3638–3645.

Furuichi, S. (2006). Information theoretical properties of Tsallis entropies. Journal of

Mathematical Physics, 47, 023302.

Furuichi, S., Yanagi, K., & Kuriyama, K. (2004). Fundamental properties of Tsallis

relative entropy. Journal of Mathematical Physics, 45, 4868–4877.

116

Page 126: Thesis Entropy Very Impotant

Gelfand, I. M., Kolmogorov, N. A., & Yaglom, A. M. (1956). On the general definition

of the amount of information. Dokl. Akad. Nauk USSR, 111(4), 745–748. (In

Russian).

Gelfand, I. M., & Yaglom, A. M. (1959). Calculation of the amount of information

about a random function contained in another such function. Usp. Mat. Nauk,

12(1), 3–52. (English translation in American Mathematical Society Transla-

tions, Providence, R.I. Series 2, vol. 12).

Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the

Bayesian restoration of images. IEEE Trans. Pattern Anal. Machine Intell., 6(6),

721–741.

Gokcay, E., & Principe, J. C. (2002). Information theoretic clustering. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence, 24(2), 158–171.

Good, I. J. (1963). Maximum entropy for hypothesis formulation, especially for mul-

tidimensional contingency tables. Ann. Math. Statist., 34, 911–934.

Gray, R. M. (1990). Entropy and Information Theory. Springer-Verlag, New York.

Grendar jr, M., & Grendar, M. (2001). Maximum entropy: Clearing up mysteries.

Entropy, 3(2), 58–63.

Guiasu, S. (1977). Information Theory with Applications. McGraw-Hill, Great Britain.

Halsey, T. C., Jensen, M. H., Kadanoff, L. P., Procaccia, I., & Shraiman, B. I. (1986).

Fractal measures and their singularities: The characterization of strange sets.

Physical Review A, 33, 1141–1151.

Hardy, G. H., Littlewood, J. E., & Polya, G. (1934). Inequalities. Cambridge.

Harremoes, P., & Topsøe, F. (2001). Maximum entropy fundamentals. Entropy, 3,

191–226.

Hartley, R. V. L. (1928). Transmission of information. Bell System Technical Journal,

7, 535.

Havrda, J., & Charvat, F. (1967). Quantification method of classification process:

Concept of structural α-entropy. Kybernetika, 3, 30–35.

Hincin, A. (1953). The concept of entropy in the theory of probability (Russian).

Uspehi Mat. Nauk, 8(3), 3–28. (English transl.: In Mathematical Foundations

of Information Theory, pp. 1-28. Dover, New York, 1957).

117

Page 127: Thesis Entropy Very Impotant

Hobson, A. (1969). A new theorem of information theory. J. Stat. Phys., 1, 383–391.

Hobson, A. (1971). Concepts in Statistical Mechanics. Gordon and Breach, New York.

Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. The University of

Michigan Press, Ann Arbor, MI.

Ireland, C., & Kullback, S. (1968). Contingency tables with given marginals.

Biometrika, 55, 179–188.

Jaynes, E. T. (1957a). Information theory and statistical mechanics i. Physical Review,

106(4), 620–630.

Jaynes, E. T. (1957b). Information theory and statistical mechanics ii. Physical Review,

108(4), 171–190.

Jaynes, E. T. (1968). Prior probabilities. IEEE Transactions on Systems Science and

Cybernetics, sec-4(3), 227–241.

Jeffreys, H. (1948). Theory of Probability (2nd Edition). Oxford Clarendon Press.

Jizba, P., & Arimitsu, T. (2004a). Observability of Renyi’s entropy. Physical Review

E, 69, 026128.

Jizba, P., & Arimitsu, T. (2004b). The world according to Renyi: thermodynamics of

fractal systems. Annals of Physics, 312, 17–59.

Johnson, O., & Vignat, C. (2005). Some results concerning maximum Renyi entropy

distributions. math.PR/0507400.

Johnson, R., & Shore, J. (1983). Comments on and correction to ’axiomatic deriva-

tion of the principle of maximum entropy and the principle of minimum cross-

entropy’ (jan 80 26-37) (corresp.). IEEE Transactions on Information Theory,

29(6), 942–943.

Kallianpur, G. (1960). On the amount of information contained in a σ-field. In Olkin,

I., & Ghurye, S. G. (Eds.), Essays in Honor of Harold Hotelling, pp. 265–273.

Stanford Univ. Press, Stanford.

Kamimura, R. (1998). Minimizing α-information for generalization and interpretation.

Algorithmica, 22(1/2), 173–197.

Kantorovitz, S. (2003). Introduction to Modern Analysis. Oxford, New York.

118

Page 128: Thesis Entropy Very Impotant

Kapur, J. N. (1994). Measures of Information and their Applications. Wiley, New

York.

Kapur, J. N., & Kesavan, H. K. (1997). Entropy Optimization Principles with Appli-

cations. Academic Press.

Karmeshu, & Sharma, S. (2006). Queue lengh distribution of network packet traffic:

Tsallis entropy maximization with fractional moments. IEEE Communications

Letters, 10(1), 34–36.

Khinchin, A. I. (1956). Mathematical Foundations of Information Theory. Dover, New

York.

Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated

annealing. Science, 220(4598), 671–680.

Kolmogorov, A. N. (1930). Sur la notion de la moyenne. Atti della R. Accademia

Nazionale dei Lincei, 12, 388–391.

Kolmogorov, A. N. (1957). Theorie der nachrichtenubermittlung. In Grell, H. (Ed.),

Arbeiten zur Informationstheorie, Vol. 1. Deutscher Verlag der Wissenschaften,

Berlin.

Kotz, S. (1966). Recent results in information theory. Journal of Applied Probability,

3(1), 1–93.

Kreps, D. M., & Porteus, E. L. (1978). Temporal resolution of uncertainty and dynamic

choice theory. Econometrica, 46, 185–200.

Kullback, S. (1959). Information Theory and Statistics. Wiley, New York.

Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Ann. Math.

Stat., 22, 79–86.

Lavenda, B. H. (1998). The analogy between coding theory and multifractals. Journal

of Physics A: Math. Gen., 31, 5651–5660.

Lazo, A. C. G. V., & Rathie, P. N. (1978). On the entropy of continuous probability

distributions. IEEE Transactions on Information Theory, IT-24(1), 120–122.

Maassen, H., & Uffink, J. B. M. (1988). Generalized entropic uncertainty relations.

Physical Review Letters, 60, 1103–1106.

119

Page 129: Thesis Entropy Very Impotant

Mahnig, T., & Muhlenbein, H. (2001). A new adaptive Boltzmann selection schedule

sds. In Proceedings of the Congress on Evolutionary Computation (CEC’2001),

pp. 183–190. IEEE Press.

Markel, J. D., & Gray, A. H. (1976). Linear Prediction of Speech. Springer-Verlag,

New York.

Martınez, S., Nicolas, F., Pennini, F., & Plastino, A. (2000). Tsallis’ entropy maxi-

mization procedure revisited. Physica A, 286, 489–502.

Masani, P. R. (1992a). The measure-theoretic aspects of entropy, Part 1. Journal of

Computational and Applied Mathematics, 40, 215–232.

Masani, P. R. (1992b). The measure-theoretic aspects of entropy, Part 2. Journal of

Computational and Applied Mathematics, 44, 245–260.

Mead, L. R., & Papanicolaou, N. (1984). Maximum entropy in the problem of mo-

ments. Journal of Mathematical Physics, 25(8), 2404–2417.

Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., & Teller, E. (1953). Equa-

tion of state calculation by fast computing machines. Journal of Chemical

Physics, 21, 1087–1092.

Morales, D., Pardo, L., Pardo, M. C., & Vajda, I. (2004). Renyi statistics for testing

composite hypotheses in general exponential models. Journal of Theoretical

and Applied Statistics, 38(2), 133–147.

Moret, M. A., Pascutti, P. G., Bisch, P. M., & Mundim, K. C. (1998). Stochastic molec-

ular optimization using generalized simulated annealing. J. Comp. Chemistry,

19, 647.

Muhlenbein, H., & Schlierkamp-Voosen, D. (1993). Predictive models for the breeder

genetic algorithm. Evolutionary Computation, 1(1), 25–49.

Nagumo, M. (1930). Uber eine klasse von mittelwerte. Japanese Journal of Mathe-

matics, 7, 71–79.

Naranan, S. (1970). Bradford’s law of bibliography of science: an interpretation.

Nature, 227, 631.

Nivanen, L., Mehaute, A. L., & Wang, Q. A. (2003). Generalized algebra within a

nonextensive statistics. Rep. Math. Phys., 52, 437–434.

120

Page 130: Thesis Entropy Very Impotant

Norries, N. (1976). General means and statistical theory. The American Statistician,

30, 1–12.

Nulton, J. D., & Salamon, P. (1988). Statistical mechanics of combinatorial optimiza-

tion. Physical Review A, 37(4), 1351–1356.

Ochs, W. (1976). Basic properties of the generalized Boltzmann-Gibbs-Shannon en-

tropy. Reports on Mathematical Physics, 9, 135–155.

Ormoneit, O., & White, H. (1999). An efficient algorithm to compute maximum en-

tropy densities. Econometric Reviews, 18(2), 127–140.

Ostasiewicz, S., & Ostasiewicz, W. (2000). Means and their applications. Annals of

Operations Research, 97, 337–355.

Penna, T. J. P. (1995). Traveling salesman problem and Tsallis statistics. Physical

Review E, 51, R1.

Perez, A. (1959). Information theory with abstract alphabets. Theory of Probability

and its Applications, 4(1).

Pinsker, M. S. (1960a). Dynamical systems with completely positive or zero entropy.

Soviet Math. Dokl., 1, 937.

Pinsker, M. S. (1960b). Information and Information Stability of Random Variables

and Process. Holden-Day, San Francisco, CA. (English ed., 1964, translated

and edited by Amiel Feinstein).

Prugel-Bennett, A., & Shapiro, J. (1994). Analysis of genetic algorithms using statis-

tical mechanics. Physical Review Letters, 72(9), 1305–1309.

Queiros, S. M. D., Anteneodo, C., & Tsallis, C. (2005). Power-law distributions in

economics: a nonextensive statistical approach. In Abbott, D., Bouchaud, J.-P.,

Gabaix, X., & McCauley, J. L. (Eds.), Noise and Fluctuations in Econophysics

and Finance, pp. 151–164. SPIE, Bellingham, WA.

Rao, C. R. (1945). Information and accuracy attainable in the estimation of statistical

parameters. Bull. Calcutta Math. Soc., 37, 81–91.

Rebollo-Neira, L. (2001). Nonextensive maximum-entropy-based formalism for data

subset selection. Physical Review E, 65, 011113.

121

Page 131: Thesis Entropy Very Impotant

Renyi, A. (1959). On the dimension and entropy of probability distributions. Acta

Math. Acad. Sci. Hung., 10, 193–215. (reprinted in (Turan, 1976), pp. 320-342).

Renyi, A. (1960). Some fundamental questions of information theory. MTA III. Oszt.

Kozl., 10, 251–282. (reprinted in (Turan, 1976), pp. 526-552).

Renyi, A. (1961). On measures of entropy and information. In Proceedings of

the Fourth Berkeley Symposium on Mathematical Statistics and Probability,

pp. 547–561 Berkeley-Los Angeles. University of California Press. (reprinted

in (Turan, 1976), pp. 565-580).

Renyi, A. (1965). On the foundations of information theory. Rev. Inst. Internat. Stat.,

33, 1–14. (reprinted in (Turan, 1976), pp. 304-317).

Renyi, A. (1970). Probability Theory. North-Holland, Amsterdam.

Rosenblatt-Roth, M. (1964). The concept of entropy in probability theory and its

applications in the theory of information transmission through communication

channels. Theory Probab. Appl., 9(2), 212–235.

Rudin, W. (1964). Real and Complex Analysis. McGraw-Hill. (International edition,

1987).

Ryu, H. K. (1993). Maximum entropy estimation of density and regression functions.

Journal of Econometrics, 56, 397–440.

Sanov, I. N. (1957). On the probability of large deviations of random variables. Mat.

Sbornik, 42, 11–44. (in Russian).

Schutzenberger, M. B. (1954). Contribution aux applications statistiques de la theorie

de l’information. Publ. l’Institut Statist. de l’Universite de Paris, 3, 3–117.

Shannon, C. E. (1948). A mathematical theory of communication. Bell System Tech-

nical Journal, 27, 379.

Shannon, C. E. (1956). The bandwagon (edtl.). IEEE Transactions on Information

Theory, 2, 3–3.

Shannon, C. E., & Weaver, W. (1949). The Mathematical Theory of Communication.

University of Illinois Press, Urbana, Illinois.

Shore, J. E., & Johnson, R. W. (1980). Axiomatic derivation of the principle of max-

imum entropy and the principle of minimum cross-entropy. IEEE Transactions

122

Page 132: Thesis Entropy Very Impotant

on Information Theory, IT-26(1), 26–37. (See (Johnson & Shore, 1983) for

comments and corrections.).

Shore, J. E. (1981a). Minimum cross-entropy spectral analysis. IEEE Transactions on

Acoustics Speech and Signal processing, ASSP-29, 230–237.

Shore, J. E. (1981b). Properties of cross-entropy minimization. IEEE Transactions on

Information Theory, IT-27(4), 472–482.

Shore, J. E., & Gray, R. M. (1982). Minimum cross-entropy pattern classification and

cluster analysis. IEEE Transactions on Pattern Analysis and Machine Intelli-

gence, 4(1), 11–18.

Skilling, J. (1984). The maximum entropy method. Nature, 309, 748.

Smith, J. D. H. (2001). Some observations on the concepts of information theoretic

entropy and randomness. Entropy, 3, 1–11.

Stariolo, D. A., & Tsallis, C. (1995). Optimization by simulated annealing: Recent

progress. In Staufer, D. (Ed.), Annual Reviews of Computational Physics, Vol. 2,

p. 343. World Scientific, Singapore.

Sutton, P., Hunter, D. L., & Jan, N. (1994). The ground state energy of the ±j spin

glass from the genetic algorithm. Journal de Physique I France, 4, 1281–1285.

Suyari, H. (2002). Nonextensive entropies derived from from invariance of pseudoad-

ditivity. Physica Review E, 65, 066118.

Suyari, H. (2004a). Generalization of Shannon-Khinchin axioms to nonextensive sys-

tems and the uniqueness theorem for the nonextensive entropy. IEEE Transac-

tions on Information Theory, 50(8), 1783–1787.

Suyari, H. (2004b). q-Stirling’s formula in Tsallis statistics. cond-mat/0401541.

Suyari, H., & Tsukada, M. (2005). Law of error in Tsallis statistics. IEEE Transactions

on Information Theory, 51(2), 753–757.

Teweldeberhan, A. M., Plastino, A. R., & Miller, H. G. (2005). On the cut-off prescrip-

tions associated with power-law generalized thermostatistics. Physics Letters A,

343, 71–78.

Tikochinsky, Y., Tishby, N. Z., & Levine, R. D. (1984). Consistent inference of prob-

abilities for reproducible experiments. Physical Review Letters, 52, 1357–1360.

123

Page 133: Thesis Entropy Very Impotant

Topsøe, F. (2001). Basic concepts, identities and inequalities - the toolkit of informa-

tion theory. Entropy, 3, 162–190.

Tsallis, C. (1988). Possible generalization of Boltzmann Gibbs statistics. J. Stat. Phys.,

52, 479.

Tsallis, C. (1994). What are the numbers that experiments provide?. Quimica Nova,

17, 468.

Tsallis, C., & de Albuquerque, M. P. (2000). Are citations of scientific papers a case

of nonextensivity?. Eur. Phys. J. B, 13, 777–780.

Tsallis, C. (1998). Generalized entropy-based criterion for consistent testing. Physical

Review E, 58, 1442–1445.

Tsallis, C. (1999). Nonextensive statistics: Theoretical, experimental and computa-

tional evidences and connections. Brazilian Journal of Physics, 29, 1.

Tsallis, C., Levy, S. V. F., Souza, A. M. C., & Maynard, R. (1995). Statistical-

mechanical foundation of the ubiquity of levy distributions in nature. Physical

Review Letters, 75, 3589–3593.

Tsallis, C., Mendes, R. S., & Plastino, A. R. (1998). The role of constraints within

generalized nonextensive statistics. Physica A, 261, 534–554.

Tsallis, C., & Stariolo, D. A. (1996). Generalized simulated annealing. Physica A,

233, 345–406.

Turan, P. (Ed.). (1976). Selected Papers of Alfred Renyi. Akademia Kiado, Budapest.

Uffink, J. (1995). Can the maximum entropy principle be explained as a consistency

requirement?. Studies in History and Philosophy of Modern Physics, 26, 223–

261.

Uffink, J. (1996). The constraint rule of the maximum entropy principle. Studies in

History and Philosophy of Modern Physics, 27, 47–79.

Vignat, C., Hero, A. O., & Costa, J. A. (2004). About closedness by convolution of

the Tsallis maximizers. Physica A, 340, 147–152.

Wada, T., & Scarfone, A. M. (2005). Connections between Tsallis’ formalism em-

ploying the standard linear average energy and ones employing the normalized

q-average enery. Physics Letters A, 335, 351–362.

124

Page 134: Thesis Entropy Very Impotant

Watanabe, S. (1969). Knowing and Guessing. Wiley.

Wehrl, A. (1991). The many facets of entropy. Reports on Mathematical Physics, 30,

119–129.

Wiener, N. (1948). Cybernetics. Wiley, New York.

Wigner, E. P. (1960). The unreasonable effectiveness of mathematics in the natural

sciences. Communications in Pure and Applied Mathematics, 13, 1–14.

Wu, X. (2003). Calculation of maximum entropy densities with application to income

distribution. Journal of Econometrics, 115, 347–354.

Yamano, T. (2001). Information theory based on nonadditive information content.

Physical Review E, 63, 046105.

Yamano, T. (2002). Some properties of q-logarithm and q-exponential functions in

Tsallis statistics. Physica A, 305, 486–496.

Yu, Z. X., & Mo, D. (2003). Generalized simulated annealing algorithm applied in the

ellipsometric inversion problem. Thin Solid Films, 425, 108.

Zellner, A., & Highfield, R. A. (1988). Calculation of maximum entropy distributions

and approximation of marginalposterior distributions. Journal of Econometrics,

37, 195–209.

Zitnick, C. (2003). Computing Conditional Probabilities in Large Domains by Max-

imizing Renyi’s Quadratic Entropy. Ph.D. thesis, Robotics Institute, Carnegie

Mellon University.

125