ALGORITHMICINFORMATIONTHEORY ...algoinfo4u.com/AlgoInfo4u.pdf · Nevertheless the book pointed to Kolmogorov’s work on algorithmic complex-ity, probability theory and randomness

ALGORITHMIC INFORMATION THEORY:

REVIEW FOR PHYSICISTS AND NATURAL

SCIENTISTS

S D DEVINE

Victoria Management School, Victoria University of Wellington, PO Box 600,Wellington, 6140,New Zealand,

[email protected]

April 21, 2017

Preface

I was originally a research physicist in a government laboratory, with a modestpublication record. On the way to becoming a manager in the New Zealandscience system I became an economist inquiring into wealth creation throughtechnological change. In doing this, I saw serious difficulties with the neoclas-sical equilibrium view of economics, as the view failed to recognise that aneconomy is far from equilibrium. In an equilibrium situation the developmentpath is irrelevant. However the different paths and outcomes that occurredwhen the Soviet Union and China moved towards market economics, showpath is critical. Again, The 2007-2008 shock to the US and the world econ-omy shows that the macroeconomic drivers have little to do with a neoclassicalequilibrium view.

In looking for a way to approach non-equilibrium issues, I stumbled acrossChaitin’s work on Algorithmic Information Theory (AIT) outlined in the bookJohn Casti’s book ”Complexification”. This introduced me to the works ofGregory Chaitin and ultimately Li and Vitanyi’s [79] comprehensive book.While this book is a treasure trove, it is difficult for an outsider to master asthe context of much of the development is not clear to a non mathematician.Nevertheless the book pointed to Kolmogorov’s work on algorithmic complex-ity, probability theory and randomness. So while the application to economicsystems is on the back burner, I realised that AIT provided a useful tool forscientists to look at natural systems.

In short, the developments in AIT, probability and randomness seemed tome to provide a valuable tool to apply to the kind of organised complex systemsstudied in the natural world. Unfortunately, these insights are not readily

1

accessible to the scientific community, or in a form that is easy to apply. I foundthat the earlier work of Zurek and Bennett on algorithmic entropy providedrich understandings of non-equilibrium systems- yet this work seems to havebecome lost in the mists of time. Hence this book. The hope is that readerswill be able to absorb the key ideas behind AIT so that they are in a betterposition to access the mathematical developments and to apply the ideas totheir own areas of interest.

Sean Devine

Contents

Contents 2

1 Introduction 5

1.1 Some approaches to complexity . . . . . . . . . . . . . . . . . . 61.2 Algorithmic information theory (AIT) . . . . . . . . . . . . . . 7

Algorithmic Information Theory and Entropy . . . . . . . . . . 8Problems with AIT . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Algorithmic Information Theory and mathematics . . . . . . . 111.4 Real world systems . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Computation and Algorithmic Information Theory 15

2.1 The computational requirements for workable algorithms . . . . 152.2 The Turing machine . . . . . . . . . . . . . . . . . . . . . . . . 17

The Universal Turing Machine . . . . . . . . . . . . . . . . . . 18Alternative Turing Machines . . . . . . . . . . . . . . . . . . . 19Noncomputable Functions . . . . . . . . . . . . . . . . . . . . . 20

2.3 Measure Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 21Cylindrical sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3 AIT and algorithmic complexity 25

3.1 Machine dependence and the Invariance Theorem . . . . . . . . 27Issues with AIT . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Self-delimiting Coding and the Kraft inequality . . . . . . . . 313.3 Optimum coding and Shannon’s noiseless coding theorem . . . 34

Delimiting coding of a natural number . . . . . . . . . . . . . . 35The relationship with entropy, conditional entropy etc. . . . . . 37

3.4 Entropy relative to the common framework . . . . . . . . . . . 393.5 Entropy and probability . . . . . . . . . . . . . . . . . . . . . . 41

2

CONTENTS 3

3.6 The fountain of all knowledge: Chaitin’s Omega . . . . . . . . 423.7 Godel’s theorem and Formal Axiomatic Systems . . . . . . . . 433.8 The algorithmic coding theorem, universal semi-measure, priors

and inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Inference and Bayesian statistics and the universal distribution 48

4 The algorithmic entropy of strings exhibiting order but with

variation or noise 53

4.1 Provisional entropy . . . . . . . . . . . . . . . . . . . . . . . . . 544.2 The Algorithmic Minimum Statistics Approach approach to the

provisional entropy . . . . . . . . . . . . . . . . . . . . . . . . . 564.3 Practicalities of identifying the shortest AMSS description using

the AMSS or the provisional entropy . . . . . . . . . . . . . . . 58A simple example . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4 The provisional entropy and the actual algorithmic entropy . . 614.5 The specification in terms of a probability distribution . . . . . 63

How to specify noisy data . . . . . . . . . . . . . . . . . . . . . 65Model examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5 Order and entropy 71

5.1 The meaning of order . . . . . . . . . . . . . . . . . . . . . . . 715.2 Algorithmic Entropy and conventional Entropy . . . . . . . . . 73

Missing Information and the algorithmic description . . . . . . 78Comments on the Algorithmic Entropy, the Shannon Entropy

and the thermodynamic entropy . . . . . . . . . . . . . 80The provisional entropy as a snapshot of the system at an in-

stance, and the entropy evaluated from net bit flows intothe system . . . . . . . . . . . . . . . . . . . . . . . . . 81

6 Reversibility and computations 87

Implications of an irreversible computation that simulates a re-versible real world computation . . . . . . . . . . . . . . 88

6.1 Implications for Algorithmic entropy and the second law . . . . 906.2 The cost of restoring a system to a previous state . . . . . . . . 946.3 The algorithmic equivalent of the second law of thermodynamics 96

7 The minimum description length approach 109

8 The non-typical string and randomness 117

8.1 Outline on perspectives on randomness . . . . . . . . . . . . . . 1178.2 Martin-Lof test of randomness . . . . . . . . . . . . . . . . . . 121

9 How replication processes maintain a system far from equi-

librium 129

9.1 Introduction to maintaining a system distant from equilibrium 129

4 CONTENTS

Brief summary of earlier exposition of AIT . . . . . . . . . . . 1299.2 Examples of the computational issues around the degradation of

simple natural systems . . . . . . . . . . . . . . . . . . . . . . . 131System processes and system degradation . . . . . . . . . . . . 133

9.3 Entropy balances in a reversible system . . . . . . . . . . . . . 1349.4 Homeostasis and second law evolution . . . . . . . . . . . . . . 135

Information requirements of homeostasis . . . . . . . . . . . . . 1359.5 Replication processes to counter the second law of thermody-

namics by generating order . . . . . . . . . . . . . . . . . . . . 140Replication as a processes that creates ordered structures . . . 141

9.6 Simple illustrative examples of replication . . . . . . . . . . . . 143replicating spins . . . . . . . . . . . . . . . . . . . . . . . . . . 143Coherent photons as replicators . . . . . . . . . . . . . . . . . . 146

9.7 The entropy cost of replication with variations . . . . . . . . . 148Natural ordering through replication processes . . . . . . . . . 149Replicating algorithms . . . . . . . . . . . . . . . . . . . . . . . 152

9.8 Selection processes to maintain a system in a viable configuration156Adaptation and interdependence of replicas in an open system 160

9.9 Summary of system regulation and AIT . . . . . . . . . . . . . 166

10 AIT and Philosophical issues 169

10.1 Algorithmic Descriptions, Learning and Artificial Intelligence . 16910.2 The mathematical implications of Algorithmic Information Theory17010.3 How can we understand the universe? . . . . . . . . . . . . . . 171

Appendix A Exact cost of disturbance 177

Bibliography 179

Abstract

Chapter 1

Introduction

In recent years there has been increased interest in complexity and organisedcomplex systems whether these be physical, biological or societal systems (seefor example Casti [21]). However there is not always a consistent understandingof the meaning of the word complexity. Those approaching complexity from amathematical viewpoint identify complexity with randomness - i.e. the mostcomplex structures are those embodying the least redundancy of information,or the greatest entropy. Others from a physical, biological, or social sciencesperspective use the phrase “complex system” to identify systems that showorder or structure, but cannot be simply described. These systems do not havea relatively high entropy and, because they are far-from-equilibrium, they areanything but random. As this book is written from the perspective of the nat-ural sciences, the phrase “organised systems”, or “organised complex system”will generally be used to apply to ordered systems that are non-random andare far-from-equilibrium; i.e. those that natural scientists might call complex.However, as at times these organised systems will also be described as “com-plex”, exhibiting “complexity” or “complex behaviour”, a distinction needs tobe made with the situation where “complex” refers to randomness in the math-ematical sense. Hopefully the context will make the meaning clear so that noambiguity arises.

A critical question is whether one should inquire into a complex organ-ised system from the “bottom up so to speak, where understandings at themost fundamental level arise. However, while a bottom up understanding mayprovide insights into how organised systems can be constructed from simplersystems, at some level of complexity the bottom up process of sense makingbecomes too involved- there is just too much detail. If deeper understandingsare to emerge, high level approaches are needed to inquire into the system fromthe “top down”. The bottom up approach uses the simplest pattern-makingstructures such as a cellular automaton (see Wolfram [112]) or a Turing ma-chine to see if understandings at the fundamental level throw insights into highlevel complex or ordered systems. Moving up to a higher level of detail, a fewsteps from the bottom, these pattern making structures may be interpreted as

5

6 CHAPTER 1. INTRODUCTION

adaptive agents responding to an external environment (e.g. Crutchfield [38]).In practice, the observed complex behaviour at the higher level may be said tobe emergent because it cannot be understood in simple terms. For example,one can discuss the adaptive responses of a life form in a biosystem in thesehigher level terms. Understandings at these levels are primarily phenomeno-logical, often seeing the whole as more than “the sum of the parts” as thisis the best that can be done with the way humans make sense with limitedinformation.

However the argument here is that Algorithmic Information Theory cansuggest ways to “sum the parts” in order to provide insights into the principlesbehind the phenomenological approach. The approach of Algorithmic Infor-mation Theory (AIT) (see for example Li and Vitanyi [79] and Chaitin [31])provides a thread that can link fundamental descriptions of simple systemsto understandings of the behaviour of emerging systems and, furthermore, ishelpful in providing insights into underlying processes implicit in the more phe-nomenological descriptions. Where the high level description makes referenceto underlying structures; e.g. where nesting occurs, as it does for StaffordBeer’s Viable Systems Model [7, 8], Algorithmic Information Theory can pro-vide a bridge between a top down phenomenological approach to structure, anda bottom up, atomistic or microsystem’s, approach.

1.1 Some approaches to complexity

Crutchfield and his colleagues have developed computational mechanics ([40,38]) using the concepts of ‘causal states’ and statistical complexity to char-acterise pattern, structure and organisation in discrete-valued, discrete-timestationary stochastic processes. The computational mechanics approach recog-nises that, where a noisy pattern in a string of characters is observed as a timeor spatial sequence, there is the possibility that future members of the sequencecan be predicted. The approach [39, 95] shows that one does not need to knowthe whole past to optimally predict the future of an observed stochastic se-quence, but only what causal state the sequence comes from. All members ofthe set of pasts that generate the same stochastic future belong to the samecausal state. Measures associated with the causal state approach include thestatistical complexity, which is the Shannon entropy of the set of causal states,the excess entropy E and the entropy rate h - the entropy per symbol. Theapproach has been extended to spatial sequences using the example of onedimensional spin chains [53].

Wolfram [112] has shown that simple cellular automaton can produce richpatterns. He suggests that the universe itself is explainable in terms of simplecomputational systems. Bennett [11] has suggested that “logical depth”; ameasure of how many steps are required for a computational system to producethe desired pattern or structure, is a measure of the system’s complexity.

Each of these bottom up approaches offers insights into organised systems.However the purpose of this book is to focus on how Algorithmic Informa-

1.2. ALGORITHMIC INFORMATION THEORY (AIT) 7

tion Theory (AIT), the measure of the complexity of a structure or object byits shortest description, can provide a comprehensive understanding of organ-ised systems. The development of an AIT approach that allows for noise orvariation in pattern (Devine [45] and Vereshchagin and Vitanyi [108]) is notinconsistent with the causal state approach of Crutchfield and his colleagues.Wolfram’s approach (Wolfram [112] last chapter) also can be interpreted inan AIT framework, while Bennett’s logical depth is a measure of the time ornumber of steps required to complete a computational process. The followingsection outlines the key points of AIT.

1.2 Algorithmic information theory (AIT)

AIT has emerged to offer insights into several different mathematical and phys-ical problems. Solomonoff [97] was interested in inductive reasoning and artifi-cial intelligence. Solomonoff argued that to send the result of a series of randomevents (such as the outcome of a series of football result) every result wouldneed to be transmitted as a string of numbers. On the other hand, the first100 digits of π could be transmitted more efficiently by transmitting the algo-rithm that generates 100 digits of π. The Soviet mathematician Kolmogorov[72] developed the theory as part of his approach to probability, randomnessand mutual information. Gregory Chaitin [24] independently developed thetheory to describe pattern in strings and then moved on to inquire into in-completeness in formal systems and randomness in mathematics. Significantdevelopments took place within the Kolmogorov school but these did not be-come readily available to those in the West until researchers from the schoolmoved to Europe and the US.

While the next chapter will describe the theory in more depth, the coreof AIT is that where a structure can be represented by a string of symbols,the shortest algorithm that is able to generate the string is a measure of whatmathematicians call the “complexity” of the string and therefore its structure.(Here, to avoid ambiguity, the phrase “algorithmic complexity” will be usedto denote this concept.) As the length of any algorithm will depend on thecomputational system used, such a measure would appear to be of little use.Fortunately, the length of the set of instructions that generates the string isnot particularly computer dependent (see section 3.1). As a consequence, astructure represented by a patterned string can be described by an algorithmmuch shorter than the string itself. On the other hand, a string that is a randomsequence of symbols represents a system that has no discernible organisation.Such a string can only be described by an algorithm longer than the length ofthe string. For example a string made up of a thousand repeats of ‘01’ has twothousand characters and is of the form s = 010101...0101. This string can bedescribed by the algorithm

PRINT 01, 1000 times.


The algorithm is relatively short. Its size will be dominated by the code for‘1000’ which, for a binary algorithm, is close to log21000. On the other hand arandom binary string, of the same length s = 10011...0111 can only be producedby specifying each symbol; i.e. PRINT 10011...0111. The instruction set willbe at least 2,000 characters long, plus some extra instructions to specify theprint statement.

In order for a code to be uniquely decoded, either an endmarker is required,or the codes are constructed in a way so that no code is a prefix of any othercode. The latter codes, which in effect can be read instantaneously, come froma prefix free set. Because they contain information about their length, theyare called “self-delimiting” coding (or for some strange reason “prefix” coding).Throughout this work, code words from a “prefix-free set”, or code words thatare “self-delimiting” will be used to describe these codes. As is discussed later,algorithms that do not need endmarkers have properties that align with theShannon entropy measure as they obey the Kraft inequality. If an algorithmwithout endmarkers is to halt in a unique state, no algorithm can be a prefixof another, otherwise it would halt too early. In this case, the algorithm itselfmust also come from a prefix free set and be self-delimiting.

As is shown later (section 3.2) the extension of the Kraft inequality to aninfinite set of programmes can be used to ensure programmes themselves areself-delimiting. This then

• allows algorithms and subroutines to be concatenated (joined), with littleincrease in overall length;

• provides a connection between the length of an algorithm describing astring and the probability of that algorithm in the set of all algorithms –thus tying AIT more closely to information theory; and

• allows a universal distribution to be defined and by so doing providesinsights into betting payoffs and inductive reasoning;

Algorithmic Information Theory and Entropy

The algorithmic entropy of a string is taken to be identical to the algorithmiccomplexity measure using self-delimiting coding. I.e. the algorithmic entropyof the string equals the length of the shortest programme using self-delimitingcoding that generates the string. Once provision is made for the computationaloverheads, (see sections 4.1 and 5.2) a calculation of the algorithmic entropyfor a string representing an equilibrium state of a thermodynamic system isvirtually identical to the Shannon entropy of the equilibrium state [96] and,allowing for different units virtually equals the entropy of statistical thermody-namics [105]. However conceptually, the algorithmic entropy is different fromthe traditional entropies. The algorithmic entropy is a measure of the exactstate of a system and therefore has meaning outside of equilibrium and wouldappear to be more fundamental. Zurek [113] has shown how Algorithmic Infor-mation Theory can resolve the graining problem in statistical thermodynamics.

1.2. ALGORITHMIC INFORMATION THEORY (AIT) 9

Zurek also shows how to specify the state of the N particles in a Boltzmanngas (i.e. a gas where the particles have no internal degrees of freedom) in acontainer. The position and momentum coordinates of all the particles can bespecified in a 6N position and momentum phase space. The grain or cellularstructure of the phase space is first defined algorithmically. Once this is done,the particular microstate of the gas, corresponding to an actual instantaneousconfiguration, can be described by a sequence of 0’s and 1’s depending onwhether a cell in phase space is empty or occupied. The algorithmic entropy ofa such a microstate at equilibrium is virtually identical to the Shannon entropyof the system. Zurek has defined the concept of “Physical Entropy” to quan-tify a system where some information can be described algorithmically and theremaining uncertainty encapsulated in a Shannon entropy term. However it isnow clear that a new type of entropy, such as Physical Entropy is unnecessaryas an algorithm that is able to generate a given string with structure but alsowith variation of the structure or uncertainty can be found and its length willreturn the same entropy value as Zurek’s Physical Entropy (see next sectionand section 4) . The limiting value of the physical entropy of Zurek, where allthe patterned information is captured, can be identified as the true algorithmicentropy of a noisy or variable patterned string.

Problems with AIT

There are, as Crutchfield and his colleagues Feldman and Shalizi [95, 41] pointout, a number of apparent difficulties with AIT. Some of these are to do withcustom or notation. For example the use of the word “complexity” by math-ematicians to measure degree of randomness does not fit in with the intuitiveapproach of the natural scientist where the most complex systems are neithersimple nor random, but rather are those systems that show emergent or nonsimple behaviour. Provided everyone is consistent (which they are not) thereshould be little problem. As is mentioned earlier, the approach here is touse the phrase “organised systems” to refer to what a natural scientist mightcall a “complex system”, and to use “algorithmic complexity” or “algorithmicentropy” to mean the measure of randomness used by mathematicians. Theremaining two key problems that these authors identify with the AIT approachare:

• The algorithmic entropy is not computable because of the Turing Haltingproblem; and

• The algorithmic description of a string must be exact; there is no apparentroom for noise, as the description of the noise is likely to swamp analgorithmic description of a noisy patterned string.

However the so called halting problem is a fact rather than a problem. Anymeasure of complexity that avoids the halting problem inevitably sacrificesexactness. Nevertheless, when pattern is discovered in a string by whatevermeans, it shows a shorter algorithm exists. The second bullet point above is


more critical. The author found that AIT could not be applied to even thesimplest natural system because of the assumed inability to deal with noise orvariation. However it has since become clear that AIT is able describe noisystrings effectively (Devine [45] and Vereshchagin and Vitanyi [108]). In essence,the algorithmic entropy of a string exhibiting a noisy pattern or variation of abasic pattern, is defined by a two part algorithm. The first specifies the set ofpatterned strings and the second identifies a particular string within the set ofoutcomes with the same characteristic pattern. The second stage contributionturns out to have a value virtually identical to the Shannon entropy of thepatterned set; i.e. the uncertainty inherent in set of outcomes.

As the algorithmic description of the pattern using this approach may notbe the shortest possible, the algorithmic complexity or the algorithmic en-tropy is not defined in an absolute sense. Rather it is a measure of the upperlimit to the algorithmic entropy. Here this upper limit will be called the “pro-visional” entropy, with the understanding the true entropy may be less (seesection 4). Previously Devine [45] used the word “revealed” entropy. However“provisional” may be better as it ties in with Zurek’s “physical entropy’ andthe Algorithmic Minimum Sufficient Statistic measure of a particular stringsection outlined in 4.2. These different measures, and that of Gacs, ([59, 60]which extends the idea to infinite strings are all essentially the same. They pro-vide an entropy measure of a noisy string based on the available informationor observed pattern or structure.

An approach based on the Minimum Description Length (MDL) (see section7) also provides a description of a string fitting noisy data. The ideal MDLapproach models a system by finding the minimum description length; i.e. theshortest model that fits the data. This description contains a description of thehypothesis or model (equivalent to the description of the set) together with acode that identifies a particular data point given the model. Given the model,the particular string xi is identified by a code based on the probability pi of theoccurrence of the string. If two vertical lines are used to denote the length ofthe string between the lines, the length of this code, i.e. |code(xi)| = −log2(pi).The expectation value of |code(xi)| is the Shannon entropy of the set of noisydeviations from the model. However while the MDL approach compresses thedata it does not actually fulfill the requirement of AIT that the algorithm mustgenerate the string. The MDL approach only explains it. The full algorithmicdescription must take the MDL description and combine this with an O(1)decoding routine to actually generate a specific string.

All these approaches reinforce the point that the algorithmic entropy in-cludes the length of the description of the hypothesis or model, and a furthercontribution that is virtually the same as the Shannon entropy measure of theuncertainty. The above arguments show that the algorithmic entropy of a noisyor variable string with a pattern can be effectively described by combining twoalgorithms. One specifies the structure or pattern characteristic of the set of allsimilar strings, while the other identifies the particular string within the set. Ineffect this second string corresponds to the Shannon entropy or the uncertaintyin the set of similar strings. Where an individual string shows no particular

1.3. ALGORITHMIC INFORMATION THEORY AND MATHEMATICS 11

structure, the string is typical and the algorithmic description or provisionalentropy returns a value the same as the Shannon entropy for the set of allsuch strings. All members of the set of similar patterned strings have the sameprovisional entropy. However a few non-typical or non-representatives stringsmay exhibit further pattern or structure. The algorithmic entropy for thesestrings will be less that that of a typical member of the set of similar strings.

1.3 Algorithmic Information Theory and mathematics

Most of the developments of AIT have taken place within the discipline ofmathematics to address problems related to randomness, formal systems andissues that would appear to have little connection with the natural sciences.However, because these theorems are cast in mathematical terms, they are notreadily available to the typical scientist.

A key outcome of Algorithmic Information Theory is its application toformal axiomatic systems. The assumptions and rules of inference of a formalsystem can be described by a finite string. The theorems of the formal systemspecified by an appropriate string, can be recursively enumerated through aprogramme that implements the axioms and rules of inference. Chaitin [28]has shown that a theorem is compressible to the axioms, the rules of inference,and the set of instruction that prove the theorem. The algorithmic complexityof a theorem is the length of the algorithm that generates the theorem therebyshowing it is provable.

Chaitin [28] has also shown that some possible theorems or propositionscannot be compressed, i.e. cannot be enunciated in terms of a simpler set ofaxioms and rules. As these propositions are represented by a string that can-not be compressed they are undecidable. They may be true or false. This isof course Godel’s theorem, but with the twist that unprovable theorems areeverywhere. As there are only 2n − 1 strings shorter than a string of lengthn, most strings cannot be compressed and therefore the associated theoremsare undecidable. While these can be made decidable by including them asnew axioms in a more complex formal system (in the sense of having a longerdescription), the number of random strings corresponding to unprovable state-ments, grows faster than the provable ones. As Chaitin [30] put it: “if one hasten pounds of axioms and a twenty-pound theorem, then that theorem cannotbe derived from those axioms”. This leads Chaitin to speculate that mathe-matics should accept this uncertainty and become experimental [29, 33, 35].Randomness, Godel’s theorem and Chaitin’s view of mathematics are discussedlater in section 3.7.

1.4 Real world systems

In the discussion about an algorithmic description of a natural system, a dis-tinction needs to be made between a physical system which emerges due to anongoing computation based on physical laws, and one that just describes the


final state from the point of view of an external observer. Wolfram [112] givesexamples of cellular automata that, operating under simple rules, producedhighly patterned structures. The rules that generate the structure after a fixednumber of steps are more likely to provide a shorter description that an algo-rithm that just describes the result. Another example is a flock of geese flyingin V formation (See New Scientist [85]). This formation can also be describedby an algorithm developed by an external observer. However the pattern arisesbecause each goose optimises its position with respect to other geese both tominimise energy use, to maintain an adequate separation from other geeseand to allow for sufficient vision to see ahead. The formation that emergesis a consequence of the computational process of each goose. The algorithmicdescription as an entropy measure will be the shortest of these two differentcomputational descriptions. This will either be the computations undertakenby each goose as computing machines, or that observed externally. However,which of these is the minimal description may depend on what degrees of free-dom the observer is interested in. It is likely that, where the entropy measuremust include all degrees of freedom, the on-going computation, based on phys-ical laws that gives rise to the system is likely to be more compressed than onejust derived from observation. Indeed, the universe itself is often assumed tobe a computational machine undertaking an on-going computation that stepsthrough all the valid states of the universe. However if we are only interestedin a small part of the universe, the algorithmic description can be compressedmore succinctly than describing the whole universe.

Algorithmic Information Theory contributes to making sense of real worldsystems in a number of ways. While the main approach of this book is pri-marily about emergence, and linking bottom up with top down approaches tounderstanding systems, reference will be made to these other contributions.AIT has provided a set of tools to address some deep philosophical questions.These might be to do with the conservation of information and entropy, andissues to do with why the universe is as it is. For example Paul Davies in theFifth Miracle [43] taps into Algorithmic Information Theory. Some of the ques-tions raised and approaches taken will be outlined below. The current bookalso expounds a view as to why scientific reductionism is such a successfulsense-making approach (see section 10.3). In short, it is because much of theobservable universe can be described algorithmically in terms of subroutinesnested within in subroutines; i.e. the universe appears to be algorithmicallysimple. The physical realm mapped by a subroutine can be studied at thesubroutine level. For example a stone falling under gravity can be described bya simple algorithm because interaction with the rest of the universe, includingthe observer, can usually be ignored. The subroutine that describes this be-haviour has low algorithmic entropy and, as a consequence, it can be analysedby our own very limited cognitive capability. If the universe were not algorith-mically simple, the reductionist process would bear little fruit as the computingcapability of the human brain, which itself is part of the universe, would beinadequate to make any sense of a physical system (See also Chaitin [34]). Ingeneral it would be expected that our brains would be insufficiently algorith-

1.4. REAL WORLD SYSTEMS 13

mically complex to decide on the truth or falsity of propositions belonging to amore algorithmically complex universe. Even allowing for the ability to extendour brain’s capability by the linking of human brains, or utilising computa-tional machines, we are no more likely to understand the universe than an antis to understand us. Much of the universe appears simple and for that reasonis amenable to a reductionist approach.

The length of the string that describes the total universe is mind bogglinglong. Clearly, as Calude and Meyerstein [19] point out, sections of the stringmight appear ordered on a large scale- e.g. the scale of our existence, yet thestring could be completely random over the full time and space scale of theuniverse that embeds this ordered section. In other words the ordering of ouruniverse may be a fluctuation in a region of the much larger random string thatspecifies a larger universe. While this may be so, it is argued in this work that,where large scale order exists, no matter how it arises, Algorithmic InformationTheory can be used to show how the laws of the universe can generate newordered structures from existing order. This is discussed in detail in section 9.

Nevertheless, there is a fundamental issue; the algorithmic measure of a de-scription requires the algorithm to halt. For example, the following algorithm,which generates the integers one at a time, is very short because it does nothalt.

N = 0

1. N = N + 1

PRINT N

GO TO 1.

(1.1)

However the algorithm that specifies a particular integer must halt when thatinteger is reached. It is much longer. Consider the following algorithm thatgenerates N1 which for example could be the 1000000th integer, is of the form

N = 0

1. N = N + 1

IFN < N1 GO TO 1

PRINT N

HALT (1.2)

About log2N bits are required to specify N1. For example if N1 = 1000000 thealgorithm is about 20 bits longer than the programme that generates integers inturn without stopping. A similar issue arises with the algorithm that describesthe universe. The universe would appear to be more like a non terminatingTuring Machine (at least so far). The patterns we observe are for ever changing.However, just as the algorithm that steps through all integers requires a haltinstruction to specify one particular integer, so the algorithm that specifies aparticular configuration of the universe, or part of it, at an instant of time


requires a halt instruction. The specification of the number of steps, eitherimplicitly or explicitly, that forces the algorithm to halt at an instant of timeis extremely large compared with the algorithm that just cycles through allthe universe’s states without stopping. As is discussed in section 6.3, thevalue of the integer specifying the halt state is the dominant contributor tothe algorithmic entropy for a particular configuration of the universe that isdistant from its initial state, but which has not reached equilibrium.

Chapter 2

Computation and AlgorithmicInformation Theory

2.1 The computational requirements for workablealgorithms

Algorithmic Information Theory defines the order or structure of an objectby the length of the algorithm that generates the string defining the struc-ture. This length is known as the Kolmogorov complexity, the algorithmiccomplexity or the programme sized complexity of the string. However, thereare two variations of the complexity measure. The first is where end markersare required for the coded instructions or subroutines to ensure the computerimplementing the instructions knows when one instruction is completed be-fore implementing the next. The Turing Machine (TM) that implements suchan instruction set from its input tape will need to distinguish instructions byseparating them with blanks or some other marker. As is shown later, thecoding with instruction end markers gives rise to what is known as the plainalgorithmic complexity. Alternatively, when no code or algorithm is allowed tobe a prefix of another, the codes, routines and algorithms are self-delimiting.I.e. there can be no blanks on the input tape to the TM in order to indicatewhen an instruction finishes. Rather, by restricting instructions to a prefix-freeset, the machine knows when an instruction is completed as these can be readinstantaneously (Levin, 1974, Chaitin, 1975). Chaitin [32] points out that anoperational programme, such as a binary algorithm implemented on a UTM,is self-delimiting as it runs from start to finish and then halts. Self-delimitingsub programmes, can operate within a main programme behaving rather likeparenthesis in a mathematical expression. The length of the algorithmic de-scription, using self-delimiting coding will be termed the ‘algorithmic entropy’as it turns out to be an entropy measure and, for physical situations, this ispreferred over plain algorithmic complexity.

However, irrespective of the coding used, the kind of algorithm and itslength depends on the type of computer used. Fortunately, for two reasons, it

15

16CHAPTER 2. COMPUTATION AND ALGORITHMIC INFORMATION

THEORY

turns out that the specifics of the computer is usually of no great significanceas a measure of complexity. The first reason is that an algorithm written for aUniversal Turing Machine (i.e. one that is sophisticated enough to simulate anyanother computing machine), can be related to any other computing machine,once the simulation instructions are known. The length of the simulation algo-rithm just adds a string of fixed length to the algorithm on the given universalmachine. The second reason is that in science, only the entropy difference isthe relevant physical measure because entropy is a function of state. Wherealgorithmic entropy differences are measured on different UTM’s, any simula-tion constant cancels. Entropy differences are therefore machine independent.A further consequence is that common algorithmic instructions, such as thosereferring to physical laws, or describing common physical situations (e.g. theresolution of the phase space) also cancel. In which case, the algorithmic en-tropy becomes closely aligned to the Shannon entropy.

The following sections deal with the theory of the computing devices knownas a Turing Machine (TM). For those who want to skip the detail, the takehome messages are:

• A Turing Machine (TM) is a standard procedure for generating any com-putable function (see 2.2). Where a set of instructions allows one tounambiguously compute a particular function from a given input, suchinstructions can be written for a TM or for any computing device that isequivalent to a TM. Each TM is equivalent to a function. For example,given an integer n there exists a set of instructions that can turn n as aninput into the function n2 as an output. I.e. the TM that implementsthis set of instructions corresponds to the function n2.

• There are many different computational processes and devices that areequivalent to a TM.

• A TM may only halt on some inputs. If the machine does not halt, nooutput is generated. Because no TM can be devised to check whether amathematical theorem is true or not, the failure to halt might mean thecomputation needs to run longer or that it might never halt. In whichcase the theorem cannot be proven to be true or false.

• A Universal Turing Machine (UTM) is one that can simulate any TM orany UTM. The typical computer, familiar to scientists, is a UTM (seesection 2.2) that is programmed. Often it is desirable to use a simplerUTM and longer programme where the prgramme instructions are trans-parent. The alternative is to use a sophisticated UTM where much ofthe computation is embodied in the hardware or the computer’s instruc-tion set. The classical universe driven by deterministic physical laws is areversible UTM. This is why we can use computers and rational laws tomap the behaviour of the universe.

Those who wish to can read the next sections, those who might want tocome back to this later can jump to the section 3.

2.2. THE TURING MACHINE 17

2.2 The Turing machine

A Turing Machine (TM) is a simple computational device or, more accurately,a computational procedure that is able to compute any function that is intu-itively computable. This is captured in the Church-Turing thesis - any effectivecomputational procedure can be represented by a TM. Generally a TM is con-sidered to be deterministic as there is a well-defined output for each step,otherwise the machine halts. The TM may be described as follows (see Casti[21] for a worked example):

• It has an arbitrary long tape (extending in one direction) divided intocells. Each cell contains a symbol from a finite alphabet. One of thesesymbols corresponds to a blank. The finite number of non-blank cells onthe tape represent the input data.

• The TM has a read/write head that can read a symbol, move one cellleft or right, and write a symbol. At the beginning of each computationalstep the read/write head is poised to read the current cell.

• The machine is in one of a finite set of states which determine the actionof the machine at each step.

• Given the symbol read, and the state of the machine, a transition tableshows what is to be written in the cell, what the new state of the machineis, and whether the head moves right or left. I.e. the transition from theold configuration to the new configuration can be written as a quin tuplesof symbols.

(current state, current symbol read) → (new symbol written, move left/right, new

state).

(Readers should be aware that some definitions express these moves as a quadtuple by restricting the machine’s ability to move and write in the same step.)The transition table is in effect the programme that determines what calcula-tion the machine undertakes. A physical machine is unnecessary, as one canundertake a Turing computation on an Excel spreadsheet, or by following in-structions using a piece of paper. It is the instructions, not the structure thatembodies the machine.

The so called machine starts in an initial configuration and during the com-putation process steps through the different configurations reading and writingon the tape at each step as determined by the transition table. If there is noeffective instruction for the current configuration to implement, the machinehalts and the symbols remaining on the tape become the computational output.A TM uses two or more symbols. Where a binary alphabet is used, symbolsneed to be coded in binary form. A simple, but inefficient, form of coding forthe natural numbers is to represent the number n by a zero and n + 1 ones


THEORY

followed by another zero, i.e. 01n+10. The number zero can then be repre-sented a 010 and 5 by 01111110. In a sense each TM is equivalent to just asingle programme and, in modern terms, the tape serves as the memory of themachine. There are equivalent variations of the simple TM. The two tape ma-chine outlined by Chaitin [25] is a more straight forward representation of thecomputational processes underpinning the measurement of algorithmic com-plexity. This machine has both a work tape and a programme tape. The worktape initially contains the input string y. The computation can be representedby T (p, y) = x which implies, given input y, and programme p, the machineT halts to output x. The two tape machine is more efficient in a time senseand, because most of the instructions are on the programme tape, the machineis more like a conventional computer as the programme controlling the outputcan be manipulated external to the computational system.

The Universal Turing Machine

Each possible TM can be viewed either as machine that implements a specificalgorithm, or as an algorithmic process to enumerate a function. A UniversalTuring Machine (UTM) is one that can simulate any other TM by includingthe simulation instructions as a programme input. As has been mentioned,a TM is defined by its set of quin tuples (or quad tuples). These map thebefore and after states of the transition table. A UTM is a machine that, whengiven the quin tuples that represents the transition table, can simulate theTM. It becomes possible to define the string of a quin tuple that specifies atransition table for a particular TM by a unique number known as its Godelnumber. The ith set of quin tuplets is given the Godel number i. As thereis a one-to-one correspondence with the integers through the Godel number,the set of all TMs is countable but infinite. Once the Godel number i forthe TM denoted by Ti is established, its transition table can be generated bya simple routine that generates all possible quin tuples by stepping throughall strings in lexicographical order until the the ith transition table is found,rejecting those that do not represent valid transitions,. This is similar to aroutine that steps through all the natural numbers selecting, say the evenones. I.e., the UTM denoted by U can simulate the computation Ti(p) = xby prefixing a simulation programme to p. Noting that instructions in thefollowing computation are implemented from the left, the computation thatgenerates x on U is U(e1i0p) = Ti(p) = x. The routine e is relatively short. Itcounts the 1’s until the 0 is reached. This gives i for the UTM and the transitiontable for the ith UTM can be found by stepping through all transition tablestill the ith one is reached. The UTM can simulate the ith TM having generatedits transition table. While there are an infinite number of possible UniversalTMs, Marvin Minsky has described a 7-state 4-symbol UTM [84] and StephenWolfram has described a 2 state 3 colour TM that has recently been shown tobe universal by Alex Smith [107].

2.2. THE TURING MACHINE 19

Alternative Turing Machines

There are a number of alternative computing procedures or processes that areequivalent to a TM or are extensions of Turing Machines. These are brieflyoutlined below for completeness.

1. Register Machines. The register machine is Turing equivalent and getsits name from its one or more “registers” which replaces a TM’s tapesand head. Each register holds a single positive integer. The registersare manipulated by simple instructions like increment and decrement.An instruction that, on a simple condition, jumps to another instructionprovides programming flexibility

2. Tag System. A tag system may also be viewed as an abstract machinethat is equivalent to a TM. It is sometimes called a Post tag machine asit was proposed by Emil Leon Post in 1943. It is a finite state machinewith one tape of unbounded length. The tape symbols form a first in firstout queue, such that at each step the machine reads the symbol at thehead of the queue, deletes a fixed number of symbols and adds symbols.One can construct a tag machine to simulate any TM.

3. Lambda calculus, developed by Church and Kleene in the 1930s is thesmallest universal programming language. It is a formal system specifi-cally designed to investigate function definition, function application, andrecursion. While it is equivalent to a TM, its focus is on the transforma-tion rules rather than computational steps.

4. A reversible TM. A TM can be made reversible where information aboutpast states is either stored, or its instructions themselves are reversible.As the laws of physics are reversible they can only be simulated ade-quately by a reversible UTM. The physical implications of this are dis-cussed in section 6.

5. Probabilistic TM. Generally a TM is deterministic as there is a welldefined output for each step, otherwise the machine halts. But a proba-bilistic TM, can be envisaged where, at some points in the computation,the machine can randomly access a 0 or a 1. The output of such amachine can then be stochastic. However, as there is no valid randomnumber generator, only pseudo probabilistic machines would seem to ex-ist. Nevertheless, Casti and Calude [23] have proposed that a quantumnumber generator be used to provide a true random input. Their articlein New Scientist suggests some intriguing applications of a truly proba-bilistic TM. However, at least at the present time, AIT restricts itself todeterministic computation.

It should also be noted that if quantum computers become possible (e.g. NewScientists [42]), the Strong Church-Turing Thesis would not apply and efficientquantum algorithms could perform tasks in polynomial time.


THEORY

Noncomputable Functions

A computable number is one for which there exists a defined procedure thatcan output the number to an agreed level of precision. I.e. there exists a TMthat can implement the procedure. For example, as π can be evaluated to say100 decimal places by a defined procedure, it is a real number that is said tobe Turing-computable. One envisages a TM as a function that maps naturalnumbers as an input on to a natural number as outputs. For example if theinputs are two numbers, the output can be the sum of these. But there is onlya countably infinite number of TMs. Countably infinite means one can assigna unique natural number to each TM to count each, even if their number isinfinite. The unique natural number for the TM is called its Godel number.But as the number of TMs is only countably infinite, and as functions of thenatural numbers, are infinite and uncountable, there are insufficient TMs tocompute every function. The set of all real numbers are also uncountable andtherefore most reals are not computable. As a consequence a TM cannot haltfor all inputs. If a TM halts after a designated finite number of steps thefunction is computable. However if it does not halt in the designated numberof steps it is uncertain whether it will halt if allowed to continue running.

Turing showed [106] that, where a computation is formulated as a decisionalgorithm, the inability to know whether a computation will halt implies thatthere are some propositions that are undecidable. The proof of the haltingproblem as it is known, relies on a self-referential-algorithm analogous to theEpimendides’ paradox “this statement is false”. As a consequence, the haltingproblem is equivalent to Godel’s incompleteness theorem, and leads to Rice’stheorem which shows that any nontrivial question about the behaviour or out-put of a TM is undecidable.

A partially recursive function, or a partially recursive computation is onethat is computable. I.e. the TM expressing the function halts on some inputsbut not all. A Turing computer can therefore be envisaged as a partiallyrecursive function; a procedure that maps some members of an input set ofnatural numbers on to an output set of natural numbers. On the other hand,a recursive function (or more strictly total recursive function) is defined forall arguments (inputs) and the TM that embodies the procedure halts on allinputs.

A recursively enumerable or computably enumerable set is one for whichan algorithmic or decision procedure exists that, when given a natural numberor group of numbers, will halt if the number or group is a member of the set.Equivalently if a procedure exists to list every member of the set, it can beestablished that a particular string is a member of the set. The set is in effectthe domain of a partially recursive function. However, as there is no certaintythat the computation will halt, a partially recursive function is semi -decidable;the result of any computation is yes/possibly yes. All finite sets are decidableand recursively enumerable, as are many countable infinite sets, such as thenatural numbers. From the point of view of this book, the provisional entropymeasure requires the set of all strings that satisfy a particular requirement to

2.3. MEASURE THEORY 21

be recursively enumerable. I.e. a procedure exists to determine whether astring is a member of the set or not.

From a scientific perspective, a primitive recursive function needs to bedistinguished from a recursive functions. Primitive recursive functions forma subset of the recursive functions and correspond to functions that can begenerated by a recursive process applied to a few simple functions. Most func-tions used in science are in fact primitive recursive as they can be implementedby a computation involving a “Do” or a “FOR” loop that terminates after aspecified number of cycles.

2.3 Measure Theory

Useful measures, such as, probability the counting measure (the number ofelements in subsets), and the Lebesgue measure, i.e. the distance measure,or volume measure in a multidimensional space, all have similar properties.Measure theory has been developed to provide a framework to establish theproperties of these measures on sets, such as the set of real numbers, or reals, ℜthat may be infinite. The critical feature of measure theory is that it providesthe tools to determine how a measure on separate subsets is related to themeasure for a larger set containing the subsets. For example, in the case ofthe counting measure, one expects the total number of elements in separatesubsets that have no elements in common to be the sum of the elements ineach subset. A general measure of a property of a set ,and its members, willbe denoted by µ. A measure must have appropriate additive properties likethe counting measure just mentioned. The measure for all possible events, andarrangements of events, must also be able to be defined. Key to providing aframework for measure theory is the understanding of the Borel sigma algebraor Borel field. This framework is just a series of definitions which provide therules for manipulating members of a set (e.g. the set of reals), to allow theproperties, such as probability, to be defined consistently. The Borel algebrais what is called a sigma algebra that is restricted to a set Ω which is a metricspace; i.e. a space, such as the collection of open sets on the reals, where thereis a distance function. Keeping this in mind, the collection B of subsets of ametric space, has the following properties:

• The empty set is in B

• If subset B is in B then its complement is in B

• For any collection of a countable number of sets in B their union mustalso be in B. I.e. if B1 and B2 are in B then B1 ∪ B2 are in B, or for

countably infinite members,∞⋃

i

Bi.

The axioms also imply that the collection of sets is closed under countableintersection and the collection also includes the full set Ω. For infinite reals, theBorel sigma algebra defines the smallest sigma algebra family that contains all


THEORY

the intervals; but it does not include all the subsets Ω. (The set of all subsets isknown as the power set.) In case you want to know, this restriction avoids theBanach-Tarski paradox. With these definitions of the properties of the subsetsof the metric space, the definition of a general measure µ, such as probabilityor number of members, on the Borel algebra is as follows.

• µ(B) ≥ 0 and for the empty set µ(∅) = 0

• The measure is additive.

• For any collection B1, B2, B3 with union B of countable disjoint sets, (I.e.B1 ∩ B2 = 0) then µ(B) =

∑

i(Bi); noting that B can be a countableinfinite union of sets.

This last bullet point says that if there are no elements in common betweenthe (sub)sets, the measure is just the sum of the measure for each subset.

In the particular case where the measure µ over the whole sample spaceis µ(Ω) = 1, the measure is termed a probability and is usually denoted byP . The measure definition of P corresponds to the Kolmogorov definition ofprobability used in AIT. However, for infinitely many events, the Kolmogorovprobability requires the following extra axiom ( see [79] p. 18) namely;

• For a decreasing sequence of events B1 ⊃ B2 ⊃ Bn such that the set ofintersection of all such events is empty;

⋂

nBn = 0, then the probability

in the limits is given by limn→∞P (Bn) = 0.

The Kolmogorov approach to probability provides a fundamental definitionof probability that is more useful than the traditional ‘frequentist’ approach.However it has some counter intuitive features. When the approach is appliedto the number line of reals [0,1], the probability of the set of irrationals is 1and the probability of one particular irrational is zero, as there are an infinitenumber of irrationals. It follows that the probability of the set of all rationalsis zero. In other words, the Kolmogorov probability of drawing a rational atrandom from the set of all numbers on the number line is not just virtuallyzero, it is zero.

Cylindrical sets

Not all numbers are rational. Often in science a measurement, for exampleone that specifies the position and momentum coordinate of an atom, can betreated as a rational number when it is specified to a certain level of precision.However it is sometimes necessary to develop the mathematical tools to handlethe irrational numbers. These are numbers whose decimal or binary expansionis infinite, as they cannot be expressed as a simple fraction in the form of a/bwhere a and b are integers and b is non zero. The reals or real numbers includeboth the rational and irrational numbers. Normally one considers the set ofreals to be represented by the set of infinite numbers lying on the number

2.3. MEASURE THEORY 23

line between 0 and 1 as the properties of any real greater than 1 depends onits properties between 0 and 1. Similarly, using binary notation, there are aninfinite set of reals starting with the sequence x, i.e. the number on the numberline represented by 0.x . . . . . . . . .. The concept of a cylindrical set provides aconvenient notation for dealing with such sequences. The cylindrical set allowssets of reals belonging to the same region of the number line to be specifiedby their common prefix, which is x in the above example. The cylindrical setis denoted by Γx and is the set associated with prefix x just after the decimalpoint. For example the binary string 0.1011 . . ., where the dots indicate thesequence continues, is the one way set of infinite sequences starting with 1011and is represented by Γ1011. The cylindrical set x represents the reals from0.x0∞ to 0.x1∞. (i.e. 0.x to 0.x111 . . . where the ∞ sign refers to a repeated0 or 1). Just as in decimal notation 0.00009∞ = 0.0001, so in binary notation0.00001∞ is equal to 0.0001. It can be seen that the infinite set of 1s in thesequence 0.00001∞ can be replaced by 2−|0000| (= 2−4). In general 0.x1∞ canbe replaced by 0.x + 2−|x|. (Note here the vertical lines denote the numberof digits enclosed.) This allows one to see that Γx represents the half openset [0.x to 0.x + 2−|x|). For those not familiar with the notation, the squarebracket “[” indicates the start point is a member of the set while “)” indicatesthat the endpoint is not.

For a measure µ based on the set of reals, such as the probability measure,µ(Γx) =

∑

b µb(Γxb), or in simplified notation µ(x) =∑

µb(xb). This meansthat a measure, such as the probability on a cylindrical set starting with theprefix x, is the sum of the measure of all the strings starting with this prefix.For example the probability measure denoted by P for all strings starting with10 is given by P (Γ10) = P (Γ1010) + P (Γ1011). This is just the sum of theprobabilities of strings starting with 1010 and 1011. Useful measures includethe following.

• The Lebesgue measure λ is a measure of distance between reals, andis an invariant under translation. For the real line interval [a,b] thiscorresponds to our intuitive concept of distance and λ = (b − a). TheLebesgue measure for a three dimensional space i.e. ℜ3 is formed from theCartesian product of the one dimensinal space and is a volume measure.The 6N dimensional equivalent is the measure used to quantify volumesin phase space in statistical entropy calculations. The Lebesgue measureλ(Γx), is a measure on the cylinder represented by [0.x to 0.x + 2−|x|).An important point that is used later is that the measure corresponds to2−|x| which is the distance between the ends of the cylinder. For example,the cylinder starting with the empty set ∅ denoted by Γ∅ has the Lebesguemeasure λ(∅) = 1 as this cylinder embraces the whole number line from.∅0000000 . . . to .∅1111 . . ., i.e. .00000 . . . to 1.

• The Kolmogorov probability measure on the reals [0,1], with cylindricalsets as the subsets, has been mentioned earlier. It is a consistent measure.


THEORY

• The counting measure is straight forward; it is the number of membersin a set or subset E , and is denoted by |E|.

The semi-measure. A semi-measure does not have the simple additive prop-erties of a true measure. It is a sort of defective probability measure as thedefinition is loosened. The semi-measure µ for the set of strings Ω has

µ(Ω) ≤ 1.

In the discrete case, this corresponds to Σiµ(si) ≤ 1. For a continuous semi-measure specified on the real number line between 0 and 1, the measure hasdifferent additive properties than a true measure. I.e.

µ(Γx) ≥ Σbµ(Γxb).

The semi-measure is particularly important in Algorithmic Information Theoryas 2−H(x), where H(x) is the algorithmic entropy is a semi-measure. As section3.2 shows,

∑

2−H(x) ≤ 1 because of the Kraft-Chaitin inequality.This connection means the semi-measure is useful in dealing with the halting

probability and in developing the idea of a universal semi-measure to deal withinference as is outlined in section 3.8. The semi-measure is computable frombelow, i.e. one can approximate it from below. This is probably all a bit heavygoing for the non-mathematician. The take home message is that set theory,with the idea of cylindrical sets to represent all strings with the same initialsequence, allows a number of different measures to be defined in a way thatcan cope with infinite sets.

Chapter 3

AIT and algorithmic complexity

As was pointed out in section 1.2, Algorithmic Information Theory (AIT) wasoriginally conceived by Ray Solomonoff [97]. The approach was formalised byAndrey Kolmogorov [72] and independently by Gregory Chaitin [24]. Thesetwo authors showed how the machine dependence of the algorithmic complexitymeasure could be mostly eliminated. Later the approach was extended to self-delimiting coding of computer instructions by Leonard Levin [77, 78] and alsoGregory Chaitin [25]. Algorithmic information theory (AIT) is based on theidea that a highly ordered structure with recognisable pattern can be moresimply described than one that has no recognisable features. For examplethe sequence representing π can be generated by a relatively short computerprogramme. This allows the AIT approach to provide a formal tool to identifythe pattern or order of the structure represented by a string. In this approachthe algorithmic complexity of a string “s” is defined as length of the shortestalgorithm p∗ of length |p∗| that is able to generate that string. Here | . . . |denotes the length in bits of the programme between the vertical lines. Thelength of the shortest algorithm or programme that generates a string is alsoknown as the Kolmogorov complexity or the programme sized complexity andfurthermore, it is also a measure of the information contained in the string. Ashort programme means the string s is ordered as it is simple to describe. Thiscontrasts with a programme that is as long as the string itself and clearly showsno order. The importance of AIT for the natural sciences is that a physical orbiological system can always be converted to a string of digits that representsan instantaneous configuration of the system in an appropriate state space. Asan example, consider how one might represent the instantaneous configurationof N players on a sports field. If jumping is to be allowed, three positionand three velocity (or more strictly momentum) dimensions are required tospecify the coordinates for each player. The specification of the instantaneousconfiguration of N players, requires a binary string in the 6N dimensionalstate space of the system. For the particular case of position, momentumcoordinates, the state space is known as the phase space.

If the configuration is ordered, as would be the case if all the players were

25

26 CHAPTER 3. AIT AND ALGORITHMIC COMPLEXITY

running in line, the description would be simpler than the situation where allplayers were placed at random and running at different speeds and directions.An actual configuration in the state space of a collection of particles will specifythe position of the particle, its momentum or the equivalent, and the electronicand other states. If the string that describes a particular structure shows order,features or pattern, a short algorithmic description can be found to generatethe string.

In practice strings and algorithms are usually represented in binary form.A simple example might be that of a magnetic material such as the naturallyoccurring loadstone. The magnetic state of each iron atom can be specifiedby the direction of its magnetic moment or spin. When the spin is verticallyaligned the direction can be specified by a 1, and a 0 when the spin is pointingin the opposite direction. The instantaneous configuration of the magneticsystem can be specified by a string of 0’s and 1’s representing the alignment ofeach spin. Above, what is called the Curie transition temperature, all the spinswill be randomly aligned and there will be no net magnetism. In which case,the configuration at instant of time can be specified by a random sequence of 0’sand 1’s. However where the spins become aligned below the Curie temperature,the resultant highly ordered configuration is represented by a sequence of 1’sand, as shown below, can be described by a short algorithm.

Consider the following two sequences, which might represent the configura-tion of physical system. The first is a string of N repeated 1’s and the secondis string of length N with a random sequence of 0’s and 1’s. These could beconsidered as the result of tossing a coin N times where heads is denoted by a1 and tails by a 0. In the first case, the coin has obviously been tampered with,while in the second case the coin would most likely be a fair coin. Alternatively,these two outcomes could be considered to represent the orientation of spinsin a magnetic system as outlined in the previous paragraph. The algorithmscorresponding to two different cases are discussed in the following bullet points.

1. The first case is where the outcome is ordered, and is represented by astring of N 1’s in a row, i.e. s = “11111 . . . ”. As this is highly ordered itcan be described by a simple algorithm p′ of the form;

FOR I = 1 to N

PRINT “1”

NEXT I. (3.1)

In the second case the algorithm only needs to code the number N , thecharacter printed, together with a loop instruction that repeats the printcommand. In what follows, the notation | . . . . . . | is used to denote thelength of the binary string representing the characters or computationalinstructions between the vertical lines. In which case the length of thealgorithm p′ is:

|p′| = |N |+ |1|+ |PRINT |+ |loop instruction|. (3.2)

3.1. MACHINE DEPENDENCE AND THE INVARIANCE THEOREM 27

The integer N is usually specified by the length of its lexicographic rep-resentation. This discussed in more detail in see section 3.3). In thelexicographic representation, |N | = ⌊log2(N + 1)⌋. The floor functionaround the log term represents the greatest integer ≤ log2(N + 1). Inwhat follows logN , without the subscript, will be used to denote thisinteger - the length of the lexicographic representation. If N is large, thealgorithm will be dominated by the length of N which is close to log2Nbits. If p′ is a binary algorithm that generates s on a computer C, then|p∗|, the size of the shortest algorithmic description p∗ that generates smust be ≤ |p′|.

2. In this case the string is a random sequence of 0’s and 1’s representedby N characters of the form “110010 . . . 1100”. Because the sequence israndom it can only be generated by a binary algorithm that specifies eachcharacter individually. I.e.

p′ = PRINT “s”.

If p∗ is the shortest algorithm to do this then, then its length is

|p∗| ≈ |p′| = |s|+ |PRINT |As the length of this algorithm must include the length of the sequence,the length of the PRINT instruction, it must be somewhat greater thanthe sequence length.

In the AIT framework, the most complex strings are those that show no patternand cannot be described by a shorter algorithm. However, as is outlined be-low, to be consistent, standard procedures are required to code the algorithmsthat specify the structure, and to minimize the computer dependence of thealgorithm. There are two types of codes that can be used for the computa-tional instructions. The first is where end markers are required for each codedinstruction, routine to tell the computer when one instruction finishes and thenext starts. This coding gives rise to the plain algorithmic complexity denotedby C(s). Alternatively, when no code or algorithm is a prefix of another, thecodes or algorithms can be read instantaneously requiring no end makers. Thisrequires the codes to be restricted to set of instructions that are self-delimitingor come from a prefix-free set (Levin, 1974, Chaitin, 1975). The algorithmiccomplexity using this coding and where the algorithms themselves are self-delimiting will be denoted by H(s). Because H(s) is an entropy measure ofstring s that represents a configuration in the natural world, it will be termedthe “algorithmic entropy”. To sum up, any natural structure that shows ordercan be described by a short algorithm compared with a structure that showsno order and which can only be described by specifying each character.

3.1 Machine dependence and the Invariance Theorem

Nothing has been said about the computational device that implements theprogramme p∗. Fortunately, as is shown below, this machine dependence can


be minimised. Let p be a programme that generates s on a UTM denoted byU , i.e. U(p) = s. The algorithmic complexity measure on Machine U is givenby the shortest programme p that does this. However, as mentioned above,there are two forms of the algorithmic complexity. The first is called the ‘plainalgorithmic complexity’ where each instruction or code has an end marker toshow that it is completed. This plain algorithmic complexity is denoted byCU (s). However there is another version of the algorithmic complexity whereself-delimiting coding is used and no end markers are required. This is possi-ble where no instruction or algorithm can be a prefix of any other. In whichcase there is no ambiguity about when one instruction finishes and anotherstarts (section 3.2). This is denoted here by HU (s), and is called the algorith-mic entropy rather than complexity as it aligns with the Shannon entropy ofinformation theory. The following are the equations for plain algorithmic com-plexity and the algorithmic entropy the only difference being that the entropyversion requires self-delimiting coding

CU (s) = minU(p)=s |p| (3.3)

HU (s) = minU(p)=s |p|. (3.4)

When the UTM generates s with input string y, i.e. U(p, y) = s the measurebecomes,

CU (s|y) = minU(p,y)=s |p|. (3.5)

HU (s|y) = minU(p,y)=s |p|. (3.6)

As any UTM can simulate the computations made on any other, if anotherUTM W , implements a similar programme to generate s then

CU (s) ≤ CW (s) + c, (3.7)

HU (s) ≤ HW (s) + c, (3.8)

where the constant c allows for the length of an extra term needed to instructmachine W to simulate machine U . The constant c is of the order of 1 as itis independent of the string gnerated. I.e. c = O(1). In practice algorithmiccomplexity can be evaluated on a standard reference UTM. The O(1) term canoften be ignored as in most physical situations only differences are important.The measure of algorithmic complexity, the shortest programme to generatestring s must therefore be

C(s) ≤ CU (s) +O(1) (3.9)

H(s) ≤ HU (s) +O(1) (3.10)

When string y is given as an input to compute s, i.e. U(p, y) = s theinvariance theorem states:

C(s|y) ≤ CU (s|y) +O(1) (3.11)

3.1. MACHINE DEPENDENCE AND THE INVARIANCE THEOREM 29

H(s|y) ≤ HU (s|y) +O(1) (3.12)

The theorem is intuitively reasonable as a UTM can be built on a few simpleinstructions and an interpreter can simulate one computer on another. Theproof of the theorem can be made robust as outlined in section 2.2. The sectionshows that a UTM U can simulate the ith TM Ti. Let Ti(p

∗) = s. Let machineU read a string e, the short programme that extracts i from the string 1i0 andthen steps through all the transition tables of each TM to find the transitiontable for the ith one. With this information U can simulate Ti(p

∗) havingestablished the transition table for that machine that is in effect the programmesim(Ti) . I.e. implementing instructions from the left, U(e1i0p∗) = Ti(p

∗) = s.Thus the complexity measure CTi(s)) of string s using machine Ti is |p∗| whileon the U the measure is |e| + i + 2 + |p∗|. Hence CU (s) = CTi(s)) + sim(Ti),where sim(Ti) = |e| + i + 2. (The routine e is not usually included in thesimulation string as it usually envisaged as part of the UTM definition.) Notbecause i represents a natural number i and hence sim(T i) can be very large.However, it makes sense to use a simple UTM with a minimal instruction set tobe the reference machine. This approach encapsulates most of the instructionsin the programme as the complicated programme instructions can be compiledfrom the minimal set of instructions. For this reason Chaitin uses LISP asas a programme language. He has shown that for LISP programming thesimulation or O(1) term is of the order of 300 bits. Li and Vitanyi (page 210)using combinatory logic based on lambda calculus get

C(x) ≤ |x|+ 8 and C(∅) = 2.

An alternative two part definition of C(s) is give by Li and Vitanyi (page 107)as

C(s) = mini,p|Ti|+ |p| : Ti(p) = s+O(1) (3.13)

H(s) = mini,p|Ti|+ |p| : Ti(p) = s+O(1) (3.14)

where the minimisation is over all TMs and programmes that compute s. Thisdefinition minimises the combination of the programme length and the descrip-tion of the Turing Computer that implements the programme. This formula-tion is useful if the string has structure as the approach shows how to capturestructure and variation in the structure. Just as there is an optimum curve tofit a set of noisy data, one could have a highly complex curve attempting togo through each data point, or a simple curve with deviations from the curvespecified. The optimum is where the deviations are random. Similarly, onecan put most of the description in the TM, in which case the programme willbe short, or put most of the description in the programme having a simpleTM. There clearly will be an optimum description. This is effectively whatis called the Minimum Description Length and is discussed in section 4.2, theoptimum occurs when all the pattern is included in the machine descriptionand all the random structure is in the programme p. Section 3.2 and followingsections focus on the relationship between the algorithmic entropy H(s) andthe traditional entropies.


Issues with AIT

Crutchfield and his co-workers (Shalizi [95] Feldman, [41]) have noted thefollowing difficulties with the AIT approach to measuring the complexity ofstrings.

• There is no way to know in advance whether the string s has a shorterdescription. Even if a shorter description, s′ is found, there is no certaintythat it is the shortest description. This is equivalent to the Turing haltingproblem.

• The AIT approach must specify a single string exactly. It appeared thatirrelevant information from an observer point of view, including noise,must be exactly specified making the description excessively long andpossibly useless for most observed patterns.

• The computational resources (time and storage) to run an AIT algorithmcan grow in an unbounded manner.

• The complexity is greatest for random sequences.

• The use of a UTM may restrict the ability to distinguish between systemsthat can be described by less powerful computational models and theUTM lacks uniqueness and minimality.

• The AIT approach does not capture algebraic symmetries.

The first two bullet points were discussed in section 1.2 of the Introduction.The halting problem is a fundamental problem and using some other measureof complexity only fudges the problem. The ability of AIT to include noise andvariation in strings is no longer a problem. As is discussed further in section 4,the concept of provisional entropy, or Algorithmic Minimum Sufficient Statistic,circumnavigates this perceived problem. The following bullet points discuss theremaining issues

• The third bullet depends on what measure of complexity is appropriatefor a particular application. Certainly for many physical situations simpleAIT provides insights. However [11] has developed the additional conceptof logical depth to capture issues related to resource use, while severalother authors have discussed the tradeoff between size and resources [15,79, 109].

• The fourth bullet point is a matter of definition as the critics recognise.

• The fifth bullet point arises because Crutchfield [38] is able to show thatstochastic processes can be explained by a hierarchy of machines. Simplerprocesses can be described by computing devices that are much simplerthan a TM. That this is not a problem can be seen if one uses the defi-nition of algorithmic complexity that optimises the sum of the computer

3.2. SELF-DELIMITING CODING AND THE KRAFT INEQUALITY 31

device description and the programme as in equation (3.13). Whenevera simple TM captures the pattern, the description of the simpler deviceis short. One always needs to be aware of the issues related to unique-ness and minimality. But for many physical situations, such as the mea-surement of the algorithmic entropy of a system, differences rather thanabsolute values are the critical measure and these concerns are of nosignificance.

• The author is not sure what to make of the last bullet point. Unless thepoint has been misunderstood. Where algebraic symmetries and scalingsymmetries are recognised, any algorithmic description recognising thesymmetry will be shorter than one that does not.

Crutchfield and his colleagues [40, 38] developed the causal state approachto provide a systematic process to highlight structure in a stationary stochasticseries. However the purpose of AIT, in contrast to the causal state approach,is to describe the pattern, not to discover it. Hence any pattern recognitionprocess, that is able to highlight hitherto unobserved structure, can be incor-porated into the AIT description. These include the causal state approach, thepractical Minimum Description Length perspective (see section 7), and othersystematic processes, whether based on symmetry or physical understanding.Once pattern is identified by whatever means, further compression is seen tobe possible.

3.2 Self-delimiting Coding and the Kraft inequality

A machine implementing an algorithm must recognise where one instructionends and another begins. This can simply done be using an end marker todelimit the code or the instruction. However, as Levin [78], Gacs [56] andChaitin [25] have shown, there are many advantages if the coding of eachinstruction or number is taken from a prefix-free set. In such a set, no codecan be the prefix of any other code, and no end markers are necessary. Inthis case these codes can be called instantaneous codes, because they can bedecoded as they are read, or self-delimiting codes as the code indicates its ownlength. They are sometimes termed prefix codes which would seem confusingas, strictly, they come from a prefix-free set. Computations in the naturalworld implementing physical laws are self delimiting. As has been mentioned,the Kraft inequality holds for self-delimiting codes. Let the codeword lengthof the ith code from the n codes in the prefix-free set be |xi|; then for binarycoding the Kraft inequality states:

∑

i

2−|xi| ≤ 1. (3.15)

This can be generalised to other alphabets.Figure 3.1 shows the restriction on allowable codes if no code is to be a

prefix of any other. Whenever a codeword is assigned, such as the codes in


1

0

00

001

10

010

011

01

11

111

110

101

100

0111

0110

0101

0100

Still available

1111

1110

000

1101

1100

1011

1010

11011

11010

11001

11000

11111

11110

11101

11110

110100110101

Figure 3.1: Selecting binary codes so that no selected code is a prefix of another

bold type in the figure, to ensure no code is a prefix of another, the assignedcode cannot have a direct connection with a prior code or a subsequent code inthe code tree. Thus any branch in the binary tree that starts with an alreadyassigned code must be blocked off, as the remainder of that branch cannot beused as indicated by the dashed lines in the tree.

For example, if a code is assigned as 100, 101 can be assigned to a code,only if no more codes in this brance of the tree are needed. If more codes areneeded the potential codes starting with 101 must be extended to 1011 and or1010 which, in the diagram, are codes 0111 and 0110. In general more codescan only be assigned if an unused code is extended further. rather than beingassigned. At each branch, an allowable code of length |x| contributes 2−|x| tothe Kraft sum. As a consequence, the sum of the assigned codes

∑

i 2−|xi| is

the series 12 + 1

4 + 18 etc. and at most equals 1.

Whenever a code is assigned, e.g. 0111 and/or 0110 as in figure 3.1, atmost the contribution to the Kraft sum is 2−|0111| + 2−|0110|. This is at most2−3; the same contribution that the parent code ‘011’ would have made to thesum if it were assigned instead. As the contributions are halved at each split,inspection shows that the overall sum can never be greater than 1. Calude’s

3.2. SELF-DELIMITING CODING AND THE KRAFT INEQUALITY 33

Theorem 2.8 [16] provides a neat proof for the Kraft inequality where the codesare not restricted to a binary alphabet

The converse of the Kraft-McMillan inequality also holds. This inequal-ity states that whenever the codeword sizes satisfy the above inequality, eachmember in the set of codewords can be replaced by an instantaneous code ofthe same length that will also satisfy the inequality.

The approach can be extended to complete programmes, not just instruc-tions. If algorithms do not have end markers, no halting algorithm (e.g. thestring x representing an algorithm) can be a prefix of any other algorithm (e.gstring xy) as, reading from the left, the machine would halt before undertakingthe computation y. Thus if there are no end markers, the algorithms must alsocome from the prefix-free set. As the number of possible algorithms is infinite,the Kraft inequality needs to be extended to an infinite set of prefix-free al-gorithms. Chaitin [25] has proved the extension, known as the Kraft-Chaitininequality to infinite recursively enumerable sets.

Chaitin’s argues that as no valid algorithm can be prefix of any other for self-delimiting coding, if an algorithm p of length |p| is valid, any other algorithmstarting with p, and of the form pxxxx . . . . . . is invalid. The restriction onprefixes for valid algorithms, requires the region on the number line between0.p0 and 0.p1 to be blocked out and unavailable for further algorithms. Asfor each valid p, the region blocked out is 2−|p| on the number line, the regionblocked out for all valid p can at most equal 1, i.e. the length on the numberline between 0 and 1 .

More formally, consider computer T (p) = x where programme p is a naturalnumber in binary form. If p is a self-delimiting programme no other programmecan start with p. If one considers all the real numbers between [0,1], theinfinite set of strings starting with 0.p must be blocked out and cannot be usedfor another programme. Section 2.3, introduces the idea of cylindrical sets,where the cylinder Γp representing the reals between 0.p0∞ to 0.p1∞ covers allthese non-allowed programmes. As section 2.3, shows the Lebesgue or distancemeasure λ(Γp) on this cylinder is 2−|p|. If all p come from a prefix-free set, noneof the cylindrical sets can overlap (i.e. they are disjoint). The measure propertymeans that the sum over all, possibly infinite p satisfies

∑

p λ(Γp) =∑

p 2−|p|.

This sum can be no greater than 1, the value of the Lebesgue measure on allthe reals [0,1]. I.e. the Kraft equality holds and

∑

p 2−|p| ≤ 1. Equality occurs

when the union of all the cylindrical sets completely covers the reals.However two standalone programmes cannot be concatenated (joined) as

the first implemented will halt before implementing the second. Nevertheless,Bennett [11] shows these halting programmes can be joined (see 3.4) by theinsertion of an appropriate linking instruction. I.e. if U(p, w) = x and halts,and U(q, x) = y then U(rpq, w) = y, where r is an instruction that takes theoutput of the first routine to be an input of the second. At the most, thejoining code adds O(1) to the length of the combined routine.

Instead of using the plain algorithmic complexity as outlined in the section3.1, it is more useful for physical situations to use a definition that restricts thealgorithms to a prefix-free Some authors use K(x) [79], for the version using


self-delimiting coding. However because this form of coding gives rise to anentropy measure, aligned to the Shannon entropy, in the following the termalgorithmic entropy denoted by H(x) or Halgo(x) will be used instead of K(x).The algorithmic entropy H(x), differs from the C(x), the plain algorithmiccomplexity, because self-delimiting coding leads to a slightly longer algorithm.

Because the algorithmic entropy is the length of an algorithm in the Kraftsum, the algorithmic entropy also must satisfy the Kraft inequality; namely∑

all x 2−H(x) ≤ 1. Nevertheless, the equations for H(x) are identical in form

to C(x) as the following indicates.

HU (x|y) = minU(p,y)=x |p|where U only accepts strings from a prefix-free set. Just to reiterate equations3.10 and 3.12, are given below. I.e.

H(x) ≤ HU (x) +O(1), (3.16)

and alsoH(x|y) ≤ HU (x|y) +O(1), (3.17)

where, in the latter case the input y is fed into the computation process. Inpractice one evaluates HU (x) on a standard reference computer, sometimescalled a Chaitin computer (see Calude [16]), knowing that the true algorithmiccomplexity will be within O(1) at worst. Provided that comparisons of algorith-mic complexity are measured on the same reference computer, the O(1) termis of little significance. In the physical situation the entropy is the preferredconcept as it avoids the ambiguity over the meaning of the word complexityand secondly, as is shown later, (section 5.2) the algorithmic entropy is virtu-ally the same as the Shannon entropy. Furthermore, concepts such as mutualinformation, conditional entropy etc., arise naturally.

3.3 Optimum coding and Shannon’s noiseless codingtheorem

If one wishes to code a message efficiently, it makes sense to use the shortestcodes for the message symbols that occur most frequently. Shannon’s noiselesscoding theorem shows how this can be done. Consider a set of message symbolss1, s2, s3, s4, s5,.... sk that occur in the expected message with probabilitiesp1, p2, p3, p4, p5, ... pk. Let the self-delimiting or prefix-free binary codewords for these be code(s1), code(s2), code(s3), code(s4), code(s5), ....code(sk).Shannon - Fano coding is an efficient coding methodology that satisfies theKraft inequality with code words constrained by:

−log2pk ≤ |code(sk)| ≤ 1− log2pk. (3.18)

This implies that 2−|code(sk)| ≤ pk ≤ 21−|code(sk)|. Shannon-Fano coding can beimplemented for a set of symbols that are used to create a message by ordering

3.3. OPTIMUM CODING AND SHANNON’S NOISELESS CODINGTHEOREM 35

the set of symbols from the most probable to the least probable. Divide thelist into two halves so that the total probabilities in each half are as near equalas possible to form two subsets. Assign a 0 as the first bit of the code forthe members of one subset and a 1 to the first bit for the codes in the othersubset. The process is repeated for each subset. I.e. each of the two subsetsis divided into halves so that the total probabilities of the two halves are asequal as possible. The second bit in the code for is found by assigning a 0 toone of the subsubsets and a 1 to the other. The process is repeated till eachmessage has a unique code. By splitting in this way, the process ensure thatno code for a symbol is a prefix of any other and the code lengths are no morethan −log2pi + 1.

An alternative, Huffman coding combines probabilities to form even moreefficient coding. An arithmetic code [86] is a more efficient based entropy codeand is in principle slightly better. Hence, given the probabilities of symbolsin the message, the average length of coding can be made very close to theShannon entropy of the source Hsh; i.e. the entropy based on the expected oc-currence of symbols. The expected code length per symbol (i.e.

∑

pk|code(sk)|)in a message coded in this way, is given by

Hsh ≤∑

pk|code(sk)| ≤ Hsh + 1 (3.19)

where Hsh = −∑

k pklog2pk is the Shannon entropy of the set of source sym-bols. As a consequence, given the probability distribution of the symbols, theexpected length of a message consisting of N symbols can be made close NHsh.This equation is known as Shannon’s noiseless coding theorem. A code wherethe expected code length per symbol equals the entropy, is optimal for thegiven probability distribution and cannot be bettered.

The following subsection shows how to optimally code the natural numberswhen the probability distribution of the natural numbers in the set of mes-sage symbols is not known. Later in section 3.8 the algorithmic equivalent ofShannon’s coding theorem will be discussed.

Delimiting coding of a natural number

As the algorithmic entropy requires the instructions to be self-delimiting, oneneeds to consider how to do this for a natural number. It is straightforwardto assign self-delimiting codes to a finite set of instructions or numbers with aknown probability distribution using Shannon’s noiseless coding theorem out-lined in the previous section. The programme can uniquely read each codeand then decode it using a table or a decoding function. However, it is oftennecessary to code a natural number as an input to an algorithm, for exampleto provide information to define the number of steps in computational loop.There are two problems in doing this. The first is that the conventional binaryrepresentation of a natural number is ambiguous as it is unclear whether thestring ‘010’ is the same as the string ‘10’. The ambiguity of a binary represen-tation can be avoided by coding the number using lexicographic (dictionary)


ordering. E.g. the integers from 0 to 6 etc. are coded lexicographically as∅, 0, 1, 00, 01, 10, 11, etc, where ∅ is the empty string. It can be seen that thelexicographic equivalent of a natural number n is the binary equivalent of n+1with the leading 1 removed. This is equivalent to |n| = ⌊log2(n + 1)⌋ wherethe floor function in the equation is just the integer below the term in the Lshaped brackets. Here this will be denoted by logn, i.e. |n| = logn, noting thatthis is, at worst, one bit less than log2n. The second problem is that wherethere is no probability distribution for the numbers to be coded, Shannon’stheorem cannot be used and one must find another way of making the codesself-delimiting.

While lexicographic coding is not self-delimiting, an inefficient form of aself-delimiting instruction can be constructed by adding a prefix string of n 1sfollowed by a zero in front of the lexicographic representation to indicate thecode size n. More formally, if n represents a natural number in lexicographicform, a possible self-delimiting universal coding of the natural number is 1|n|0n.The length of the code then becomes 2logn + 1. E.g. the number 5 is ‘10’lexicographically, and in this universal self-delimiting form becomes 11010 (i.e.two 1’s to identify the length of ‘10, which is the code for 5, followed by azero and finally 10, the code for 5). The machine reading the code countsthe number of ‘1s’ until the ‘0’ is reached. The number of ‘1s’ (i.e. 2 in theexample) tells the decoder the length of the code that is to be read followingthe ‘0’.

While this code is universal, as it optimises the average code length irrespec-tive of the unknown probability distributions, this a a lengthy representationof the number and is not asymptotically optimal (Li and Vitanyi [79] p.78,79).Calude [16] page 23 has a more compressed self-delimiting representation of nof length 1+ logn+2lognlogn bits. Alternatively, a more compressed code canbe generated by an iterative procedure. If code1(n) = 1|n|0n the first iterationis given by code2(n) = code1(n)n. This, as in the Calude case, is of length1+ logn+2lognlogn bits. A further iteration where code3(n) = code2(n)n haslength 1+ logn+ loglogn+2loglog1. In principal there exists a code for n withlength given by the infinite series,

1 + logn+ lognlogn+ logloglogn . . .

In practice the size of the representation of n will be ≤ 1 + logn + loglogn +2loglog or a higher expansion. The additional information required for self-delimiting coding can be interpreted as H(|n|), the length of the algorithmthat specifies the size of n.

An important question is how significant is the length information aboutthe code for a natural number in determining the algorithmic entropy. Are theloglog, (hereafter denoted as loglog) terms and higher terms always needed?The answer is that for an equilibrium (or a typical) state in thermodynamicsystem, the length information is insignificant. For example, there are aboutN = 1023 = 269 particles in a mole of a Boltzmann gas. The algorithmicentropy and of an equilibrium state requires the position and momentum of each

3.3. OPTIMUM CODING AND SHANNON’S NOISELESS CODINGTHEOREM 37

of the 269 particles to be specified to a required degree of precision. Assumeboth the position and momentum coordinates of each particle are defined to 1part in 28. This requires 16 bits per particle. If there are 269 particles16× 269

bits are required. The algorithmic entropy of an equilibrium configurationis therefore approximately 273 bits. The loglog term , which is the first inthe sequence capturing the length information, is log22

73 = 73. This is aninsignificant contribution to the algorithmic entropy. The next term in theseries is log2(73) which is smaller still. However where a particular stringis highly ordered, the algorithmic description will be much shorter and thecontribution the higher loglog term may be need to be included.

The discussion here is relevant to the evaluation of the provisional entropyof a string in a set of states discussed in section 4.1. The provisional entropyconsists of a two part code, one part defines the set of strings exhibiting aparticular structure or pattern, the other identifies the string within the set. Inthis case, it may be possible to embody the length information in the descriptionof the set.

The relationship with entropy, conditional entropy etc.

An important question is when is it more efficient to compute two strings xand y together using a single algorithm rather than separately. In other wordswhen is

H(x, y) < H(x) +H(y)?

If x and y are independent, there is clearly no gain in calculating them togetherin a single algorithm, but if y has information about x it would suggest thatcomputing y first will lead to a shorter description of x given y. In the caseof the Shannon entropy, in general Hsh(X,Y ) = Hsh(Y ) + Hsh(X|Y ). If therandom variable X does not depend on Y , knowing Y does not decrease theuncertainty in X, i.e. H(X|Y ) = H(X). In which case Hsh(X,Y ) = Hsh(Y )+Hsh(X).1 A similar, but not identical, relationship occurs with the algorithmicentropy. If the strings x and y are to be described together, information abouty may simplify the description of x. However, in contrast to the Shannon caseChaitin [29] has shown that, unless the compressed description y∗ of y is used,algorithmic corrections are required. Let the reference computer which, whengiven the null string ∅ and minimal programme y∗, undertakes the computationU(y∗, ∅) = y. If y∗ is used instead of y as an input to the computer thatgenerates x with the minimal programme x∗, then given y∗, U(x∗, y∗) = x. Asboth x and y can be computed knowing y∗,

H(x, y) = H(y) +H(x|y∗) +O(1)

Note that y∗ contains information about both y, and the length of the shortestdescription of y. If y rather than y∗ is to be used, H(y), the length of the most

1 Note for the Shannon entropy, upper case labels are used to indicate the sets of randomvariables.


compressed description of y needs to be passed to the computation of x, i.e.,

H(x, y) = H(y) +H(x|y,H(y)) +O(1).

The above equation recognises that, in many cases, string x can be computedmore efficiently once the minimal description of string y is known. Where thisis the case H(x|y∗) ≤ H(x) indicating that x and y can be calculated moreefficiently together than separately. I.e.

H(x, y) ≤ H(x) +H(y) +O(1).

As computing x independently may be achieved by a shorter programme thanone that computes both x and y, H(x) ≤ H(x, y). In general,

H(x) ≤ H(y) +H(x|y∗) +O(1). (3.20)

When the full description of y∗ is essential to compute x, the equals sign is nec-essary. This occurs when y must be computed as part of the shortest algorithmto compute x. For example the equals sign is needed when y∗ represents thecompressed version of the subroutines that express the physical laws needed tocalculate x in a physical situation. In general, equation 3.20 shows programmescan be joined or concatenated in a subadditive manner to within O(1). Here,the O(1) term is the few extra bits required to join two separate algorithms aswas outlined in 3.2 and is further discussed in 3.4.

Shannon has defined the mutual information between two sets X and Y asI(X : Y ) = H(X) + H(Y ) − H(X,Y ) The mutual information is a measureof how many bits of uncertainty are removed if one of the variables is known.In the algorithmic case, the mutual information between the two strings isdenoted by I(x : y) (or H(x : y)). Given the minimal programme for x, themutual information I(x : y) can be defined as:

I(x : y) = H(x : y) ≡ H(x)−H(x|y∗)

orI(x : y) ≡ H(x) +H(y)−H(x, y) +O(1)

Readers should be aware that definition of mutual information can differ slightlybetween authors. For example Chaitin in references [29, 26] has defined a sym-metric version of the mutual information by using the second equation butwithout the O(1) term. In which case an O(1) term is transposed to the firstequation. However, in the more fundamental paper, [25] Chaitin uses the def-inition above. Nevertheless, the difference in the definition is insignificant formeasures on large strings. The remarkable similarity of the algorithmic en-tropy relationships with the Shannon entropy should not be overlooked. Nei-ther should these be seen as being the same . Allowing for the use of y∗ insteadof y and the O(1) term they look identical. However the Shannon relationshipsshow how the uncertainty in observing a random variable can be reduced byknowing about a dependent random variable. The algorithmic relationships

3.4. ENTROPY RELATIVE TO THE COMMON FRAMEWORK 39

show how computing one string simplifies the computation of a related stringwhere there is common information. The algorithmic expression is about actualinformation shared between two strings. From the above algorithmic resultsother important results also follow.

H(x, y) = H(y, x) +O(1),

H(x) ≤ |x|+H(|x|) +O(1).

The equation immediately above shows that for a typical or close-to-randomstring, which is most strings, the algorithmic entropy is the length of the stringplus the cost of requiring self-delimiting coding embodied in H(|X|). If thestring can be compressed, the entropy will be less than the right hand side.Also,

H(x) = H(|x|) +H(x||x|) +O(1),

which for the case of a random string of length n gives

H(x) = H(n) +H(x|n) +O(1).

In algorithmic information theory, the information embodied in a string thatspecifies a state of a system or structure, is the minimum number of bits neededto specify the string and hence the system. This makes intuitive sense and itfollows that information I(x) is the same as the entropy i.e.,

I(x) = H(x).

Section 3.5 defines a halting probability Q. While Q is not a true probability,the algorithmic mutual information between two specific strings x and y canbe defined in terms of Q. I.e.

I(x : y) = log2(Q(x, y)/(Q(x)Q(y)) +O(1).

The expectation value of the mutual information takes the same form as theShannon definition of mutual information using probabilities but with Q(x, y)instead P (X,Y ) etc.

3.4 Entropy relative to the common framework

If y is essential as an input to generate x, and q∗ is the routine that halts ongenerating x given y, i.e. (U(q∗, y) = x. However, if y can be generated byprogramme p∗ with the computation U(p∗, w) = y, can q∗p∗ given w be used asa programme to generate x directly without y as an input? The answer is no,as the string p∗q∗ is not a valid algorithm. The computation stops when y isgenerated as p∗ is self-delimiting. I.e. U(p∗(q∗, w)) halts once y is produced anddoes not continue to generate x. As has been shown, [11, 25], for the machineU a prefix r exists which has the effect of stacking an instruction to restart thecomputation when it would otherwise end. Hence U(p∗((q∗, w)r)) = x. Here


brackets are used within brackets to indicate the order of the computations.The algorithmic entropy is then the combined entropy given by |p∗| + |q∗| +|r∗|. It follows for this particular case where y is necessary to generate x ,H(x) = H(x, y). The algorithmic entropy of string x can then be expressedas H(x) = H(y) +H(x|y,H(y)) +O(1) as y must be generated on the way togenerating x. Nevertheless, this simple additive expression only covers the caseof optimally compressed essential routines. In the case where string y is notan essential input to x, H(x) ≤ H(x, y) and ≤ H(y) +H(x|y,H(y)) +O(1).

However as is discussed further in section 5.2, the argument is not neededfor real world, reversible non-halting computations. Although self-delimitingalgorithms know when to halt, a real world real world computation is reversibleand continues indefinitely. It is when an on-going computation puts a stringrepresenting a real world configuration, such as y in a region of computationalspace accessible to an algorithm q∗, that x can be produced. The bit flows(section 5.2) show that the algorithmic entropy of x is identical to the lengthof the concatenated programmes s∗ and p∗.

As the entropy difference rather than the absolute entropy has physicalsignificance, in any physical situation, common information strings, in effectthe common instructions, can be taken as given. Effectively this informationplays the role of the critical routine p∗ in the previous paragraph. Commoninformation can include the binary specification of common instructions equiv-alent to“PRINT” and “FOR/NEXT” etc. and the O(1) uncertainty impliedby the reference UTM. In a physical situation, the common information caninclude the physical laws and for thermodynamic systems, the algorithmic de-scription of the phase space grain size [113]. In what follows the commoninformation routines will be represented by ‘CI’ and, given the common infor-mation, the physically significant entropy will be denoted by the conditionalentropy Halgo(x|CI). As the entropy difference between two configurations xand x′ is Halgo(x|CI) −Halgo(x

′|CI), the entropy zero can to be H(CI) andthe common contribution ignored.

An ordered sequence is recognized because there is an implicit reference toa patterned set that will be recursively enumerable. Any sequence x of this setcan be at least partially compressed (see section 4) and the primary entropycontribution will depend on parameters such as the length of the string and thenumber of possible states in the system. The conditional algorithmic entropyof a member of this patterned set will be termed the ‘provisional’ entropy. Theprovisional entropy is the entropy of a typical member of the patterned set; itis the upper measure of the conditional algorithm for member. However a veryfew non-typical members of the set may be further compressed. Whenever amember of a set can be compressed further, a more refined model is needed tocapture the pattern in those particular strings. This is discussed further in thesection 4.2 on Algorithmic Minimum Sufficient Statistic.

3.5. ENTROPY AND PROBABILITY 41

3.5 Entropy and probability

Let the algorithmic halting probability QU (x) be the probability that a ran-domly generated programme string will halt on the reference UTM, denotedby U with x as its output [25]. The randomly generated programme that hasa halting probability QU (x) can be considered as a sequence of 0’s and 1’sgenerated by the toss of a coin. In which case, QU (x) is defined by

QU (x) =∑

U(p)=x

2−|p| (3.21)

The sum of all such probabilities, ΩU =∑

x QU (x). As each algorithm thatgenerates x is comes from a prefix-free set ΩU ≤ 1 because of the Kraft-Chaitininequality. However, as ΩU 6= 1, QU (x) cannot be a true probability but isinstead a semi-measure. Furthermore, there is no gain in normalising QU (x)by dividing by ΩU as ΩU is not computable. Even so, this concept allows avery important connection to be made between the algorithmic entropy of astring and the halting probability. The argument is as follows.

• QU (x) ≥ 2−HU (x) as HU (x) is the minimum |p| to generate x, whilethe sum defining QU (x) includes not just one but all programmes thatgenerate x. Basically the shortest programme generating x in the sumfor QU (x) is H(x) and is a dominate contribution. This leads to theinequality,

HU (x) ≥ −log2QU (x) (3.22)

• But, H(x) cannot be much greater than −log2QU (x) as there are onlya few short programmes able to generate x. All the longer descriptionscontribute little to the sum as is shown immediately below. But, as isshown below,

HU (x) ≤ ⌈−log2Q(x)⌉+ c. (3.23)

Because of equation 3.22, it must follow that

HU (x) = −log2Q(x) + c, (3.24)

where c is a small constant.The upper limit on HU (x) (equation (3.23)) can be seen by noting (Downey

private communication) that, as HU (x) is an integer, HU (x) ≥ ⌈−log2QU (x)⌉.But ⌈−log2QU (x)⌉ can also be used in an algorithm to generate string x. Morespecifically, a routine p that takes ⌈−log2QU (x)⌉ and decodes it to give thestring x, could be measure of algorithmic complexity of x. Basically the routinep can be found by taking the integral value of −log2Q(x) and, starting withthe lowest programme of this length steps through all allowable routines inlexicographic order to see which, if any, generate the string x. If no routinegenerates x, the programme increments the length by one, and again stepsthrough all allowable routines of this length. (This is a thought experiment assome routines may not halt. However, we are only interested in finding those


that do.) Eventually one, close to −log2Q(x) will generate x. The length ofthis routine cannot be less than than HU (x) by definition. I.e. the programmep with length |p| = ⌈−log2Q(x)⌉ + |stepping routine| is an upper measure ofthe algorithmic entropy or complexity. From the second bullet point above, itfollows that HU (x) ≤ |p| or

HU (x) ≤ ⌈−log2Q(x)⌉+ c (3.25)

where c represents the length of the decoding routine. This upper limit can becombined with equation (3.22) leading to the relationship;

−log2Q(x) ≤ HU (x) ≤ −log2Q(x) + c.

Equation (3.24) follows by including the fraction that rounds up −log2Q(x)in the constant c. Note the O(1) constant c is not dependent on x. Also,as it will only vary slightly for different UTMs, one can drop the suffix Urepresenting the specific UTM. It can be seen that the constant c indicatesthat there are 2c programmes close to H(x). If there are 2c programmes ofa size close to H(x), and one ignores the significantly longer programmes itfollows that Q(x) would be ≤ 2c2−H(x). This implies what was stated abovethat H(x), HU (x) ≤ −log2QU (x) + c. In effect, the constant 2c becomes ameasure of the small number of short programmes that make a significantcontribution to Q(x).

Q(x) is a unnormalised measure of the probability that a string generated atrandom will produce x on the UTM. However it can be envisaged as an actualprobability by assigning the unused probability space to a dummy outcome.This allows Q(x) to be treated as a probability measure for a particular messagex in an ensemble of messages. Relative to this probability, the expectationvalue of H(x), 〈H(x)〉 over an ensemble of messages or strings then is closeto 〈H(x)〉 = −∑

x Q(x)log2Q(x). This indicates that equation 3.25 in theprevious paragraph is the algorithmic equivalent of Shannon’s coding theorem.The algorithmic code for a string, its algorithmic entropy, is just a constantaway from the halting probability of that string. This again reinforces whathas been shown earlier, that, while the algorithmic entropy and the Shannonentropy are conceptually different, the expectation value of the algorithmicentropy for a set of states is closely aligned to the Shannon entropy.

A more general approach to prove the same relationship, that relies on theproperties of a universal semi-measure will be outlined in section 3.8.

3.6 The fountain of all knowledge: Chaitin’s Omega

Chaitin [25] asked the question; “What is the probability that a randomlyselected programme from a prefix-free programme set, when run on a specificUTM will halt with a given output x?” Section 3.5 defined QU as this probabil-ity. Strictly it is a defective probability and should be termed a semi-measure(see section 3.5). The halting probability ΩU is obtained by summing QU overall halting outcomes x generated by a random input.

3.7. GODEL’S THEOREM AND FORMAL AXIOMATIC SYSTEMS 43

ΩU =∑

U(p)halts

2−|p| =∑

x

QU (x).

ΩU is between 0 and 1. As Ω contains all possible halting algorithms, if theuniverse operates under computable laws, Ω contains everything that can beknown about the universe; it embodies all possible knowledge. While thevalue of Ω is computer dependent, it can be converted between UTMs. Un-fortunately, all the contributions to ΩU can only be established by runningall possible programmes to see if they halt. Every k bit programme p thathalts, (i.e. where |p| = k) contributes 2−k bits to ΩU . Ω is random in theChaitin or Martin-Lof sense (section 8.2). If Ω could be generated by a simplerprogramme it would then be possible to predict whether a particular pro-gramme contributing to Ω will halt. This cannot be possible, as doing sowould violate the Turing halting theorem. Chaitin (who named this num-ber Ω) showed that, just as there is no computable function that can deter-mine in advance whether a computer will halt, there is also no computablefunction able to determine the digits of Ω. Ω is uncomputable and can onlybe approximated by checking each possible programme from the smallest up-wards to see which might halt in a given time. Interestingly Calude, Dinneen,and Shu [18] have calculated Ω to 64 places and get the binary value Ω =0.0000001000000100000110001000011010001111110010111011101000010000... . . ..

3.7 Godel’s theorem and Formal Axiomatic Systems

Chaitin [28] has developed the algorithmic equivalent of Godel’s theorem. Hehas shown that some possible theorems or propositions cannot be derived froma simpler set of axioms and rules. The rules of inference define computationalprocesses. Where a proposition represented as string can be expressed as com-putation on the set of the axioms represented as strings, as the proposition canbe expressed in a compressed form, it represent a decidable theorem. Thosepropositions that cannot be compressed in terms of the axioms are undecid-able. They may be true or false. This is of course Godel’s theorem, but withthe twist that unprovable theorems are everywhere. As there are at most only2n − 1 strings shorter than a string of length n, most propositions representedby a strings of length n cannot be compressed into a shorter string. They can-not be derived from simpler axioms and rules. Therefore the potential theoremassociated with most strings is undecidable.

In order to explore Chaitin’s approach using AIT, consider a formal logicalaxiomatic system that is as rich as arithmetic with multiplication, and whichis required to be both consistent and sound. No statement in such a systemcan be both true and false, and only true statements can be proven to be true.Let ‘A’ by a string of symbols that represents the axioms of the system. Thelength of the shortest algorithmic that describes the axioms is denoted by |A|.The set of all provable theorems is recursively enumerable. This means onecan generate all theorems by a process that steps through all possible theorem


proofs in ascending order of length, and uses a proof checking algorithm todetermine whether a particular statement or proposition generated is a validtheorem. In principle, each provable theorem can be generated from the axiomsplus the rules of inference. Those strings that might represent theorems, butwhich have no shorter description, do not belong to the recursively enumerableset of true theorems. As they cannot be expressed in terms of the axiomsand rules of inference they cannot be shown to be true or false. They areundecidable. There are far more statements that belong to the set of possibletheorems than valid theorems. Worse, as the length of the possible theoremstrings increases, the set strings associated with unprovable statements growsfaster than the provable ones. One can never be sure that a potential theoremis decidable — only when a proof (or the proof of its converse) is found willthis be known. I.e. it is only when a potential theorem can be expressedas a compressed description in terms of simpler theorems, axioms and rulesof inference, can the theorem be considered proven. This is of course Godel’stheorem but with an upsetting twist that un provable theorems are everywhere;the provable theorems are only a small selection of the space of all possibletheorems. An alternative way of looking at this is to tie the argument intothe halting problem. The undecidability of the halting problem underpins theundecidability of a rich formal system.

The above argument can be made more robust. Denote n as the length ofstring x where the statement “H(x) ≥ n” is a statement to be proved in theformal system. This statment asserts that there exists a string x where thealgorithmic complexity of the string is at least as great as its length. Such astring would be deemed random. The computational algorithm that generatesthe random string needs to specify each digit plus some overheads dependingon the computer used. It has already been argued that many such strings mustexist, but the question is whether the statement, that a string with algorithmiccomplexity at least equal to n can found and proved to be genuine. I.e. given n,is it possible to find a particular x that is random or has no shorter descriptionthan it length? There are two ways to look at this. The first is to developa procedure to step through all the valid proofs that can be derived from theaxioms and the laws of inference until an algorithmic proof p finds the first xsuch that H(x) ≥ n. As has been mentioned, this is akin to stepping throughall the integers until one is found with a particular property. The proof p willcontain the axioms, embody the algorithm stepping through and checking theproofs, as well as the specification of n to indicate when the string of interesthas been found. As the size of n can be represented by a little more than log2nbits in the algorithm, the length of this algorithm is:

|p| = |A|+ log2n+ |stepping algorithm and proof checker|+O(1). (3.26)

This implies that a computational procedure T exists that finds p such thatT (p) = x. But |p| cannot be less than the shortest possible description definedby H(x), otherwise p could be used to provide a more compressed definitionof x. If c is taken to be the length of the description of the system and the

3.7. GODEL’S THEOREM AND FORMAL AXIOMATIC SYSTEMS 45

stepping algorithm, and because |p| cannot be less than the shortest descriptionof x, it follows that

H(x) ≤ |p| = log2n+ c.

Here the finite constant c has absorbed the length of the fixed strings in equa-tion 3.26. This implies that a proof has been found that outputs the first xwhere H(x) ≥ n. But as log2n+ c > H(x), it follows that log2n+ c ≥ n. Butthis becomes impossible, as for any given finite c, there will always exist an nlarge enough so that n− log2n ≥ c. For large n such a search for the proof cannever halt as such a proof cannot be found.

A slightly different argument that is more in line with the original Godelapproach, codes a meta statement about the formal system within the system.Let us try and prove the meta statement that there exists an n for which thestatement “H(x) ≥ n is not provable”. This statement should either be true orfalse in the formal system. Assume the statement is false as it can be proven.This means we can find a contrary example - a value of x for which H(x) ≥ n.Just as in the earlier argument If the system is sound, as there can be nocontradiction, for n sufficiently large, where n > c, no such string p can befound. Again

log2n+ c ≥ n.

We cannot prove the statement is false, therefore the original statementmust be undecideable.

While these rogue statements can be made decidable by including themas new axioms in a more complex formal system (in the sense of having alonger description), even then the number of random strings corresponding tounprovable statements, grows faster than the provable ones. As Chaitin [30]put it: “if one has ten pounds of axioms and a twenty-pound theorem, then thattheorem cannot be derived from those axioms”. This leads Chaitin to speculatethat mathematics should accept this uncertainty and become experimental[29, 33, 35].

This can tell us something about Chaitin’s somewhat mystical number Ω.Ω can be constructed from the provable theorems in a logical system - thereis no contribution from undecidable theorems as the generating computationwill not halt for these. If Ω represents the logical system that encapsulates allthe physical laws, knowing Ω is equivalent to knowing everything. If Ω werecomputable, all the physical laws could be expressed simply. Calude, Dinneen,and Shu [18] have calculated Ω to 64 places and get the binary value given insection 3.6.

While there are mathematical propositions in any formal system that arenot decidable, it is impossible to determine which these are, even though thenumbers of such undecidable propositions grow faster than the decidable onesas the description length of the propositions increases. If it is assumed that apossible undecidable proposition is true, the formal system can be extended.However there is always the danger that the proposition was actually decidableand the assumed axiom could be inconsistent with the formal system. This hasled Chaitin to suggest [33, 30] that mathematics should become experimental;


useful mathematical hypotheses could be taken to be true unless later incon-sistencies emerge. For example the Riemann hypothesis may be undecideable(although, for an update see the article by Kvaalen [73] in New Scientist thatsuggests it might be provable), yet experimentally it seems likely to be true.This hypothesis could then be taken as an extra axiom with the possibility thatmore new and interesting theorems could then be proved using the new axiom.

There is a further implication that many have wrestled with. Does thehuman mind operate as a finite TM when undertaking logical thought? If so,it will be constrained by Godel’s theorem. Barrow chapter 8 [6] scopes theviews of some of the key players in the debate, while more recently Feferman[52] discusses the historical development. The particular issues surroundingstrong determinism and computability have been explored by Calude et al[17]. It seems to the present author that, while the human brain or mind neednot act as a finite TM to hit upon new truths about the universe, formal logicalprocesses would appear to be needed to prove that these are truths. In whichcase, the proof process, rather than the discovery, will be constrained by Godel’stheorem. While the human mind can add new theorems and use externaldevices to extend its computational capability, once the axioms and laws ofinference become too great, the mind may not have sufficient capability in termsof its complexity to be assured of the consistency of the formal processes. Thusthere may well be laws that are beyond human comprehension in that theselaws will not be able to be deduced from simpler principles or be understoodin simpler terms. Clearly not all agree.

3.8 The algorithmic coding theorem, universalsemi-measure, priors and inference

It was shown in equation (3.18) that for an enumerable probability distributionP, where the source alphabet of symbols have probabilities as p1, p2...pk.. etc, aset of optimal codes denoted by code(sj) can be found for each symbol sj . Theexpected code length of a message made up of these symbols, gives Shannon’snoiseless coding theorem

Hsh ≤∑

pk|code(sk)| ≤ Hsh + 1.

I.e. the expected length of the coded message is within one bit of the Shan-non entropy of the message. As section 3.5 showed, Algorithmic InformationTheory has a coding theorem, called the algorithmic coding theorem, which isthe equivalent of Shannon’s noiseless coding theorem. This section looks at analternative approach to the algorithmic coding theorem based on establishinga universal semi-measure that is optimal. Just as one can have a Universal TMthat, within fixed boundaries, can simulate any other TM, one can have a uni-versal code that is asymptotically optimal in coding a message independentlyof the distribution of the source words. As is outlined below, and implied insection 3.5, this code for x is given by the semi-measures H(x) or log2Q(x).The two are equivalent to within an O(1) constant.

3.8. THE ALGORITHMIC CODING THEOREM, UNIVERSALSEMI-MEASURE, PRIORS AND INFERENCE 47

The derivation is based on the fact that there is a universal enumerablesemi-measure that is an optimal code for a discrete sample space. It is optimalin the sense that it must multiplicatively dominate all other constructive semi-measures. I.e. up to a multiplicative constant it assigns a higher probability toan outcome than any other computable distribution. In other words, if m(x)is the universal enumerable semimeasure then m(x) ≥ cµ(x) where µ(x) is anyother semi-measure in the discrete sample space. The significance of this willbecome apparent, but firstly it should be noted that the universal propertyonly holds for enumerable semi-measures, not for the class of all semi-measuresor the more restricted class of recursive semi-measures (see Li and Vitanyi page246 [79] and Hutter [67]). Gacs [58] has shown that m(x) is also constructive(i.e. computable from below).

While there are a countably infinite number of universal semi-measures, theuniversal halting probability Q(x) can be taken to be a reference universal semi-messure. All semi-measures can in principle be enumerated by establishing thehalting probability QT (x) =

∑

Ti(p)=x 2−|p| for every TM Ti. Let the universal

halting probability Q(x) be defined for a universal machine that can simulateany TM. The set of all random inputs giving rise to Q(x) will include thesimulation of every TM and every programme p that runs on each TM Ti. AsQ(x) includes every semi-measure it multiplicatively dominates them all. Theuniversal probability is therefore a universal semi-measure. However, 2−H(x)

is also a constructive semi-measure, and because H(x) is equal to −log2Q(x)to within an additive constant, 2−H(x) is equal Q(x) to within a multiplitiveconstant. H(x) must also be a universal semi-measure. In other words, thesemi-measures, m(x), Q(x) and 2−H(x) are all universal, and are all equal towithin a multiplicative constant as they, all multiplicatively dominate eachother. As a consequence, their logarithms additively dominate each other.

This gives rise to the algorithmic equivalent of the coding theorem [78]. Thefollowing equality holds to within an additive constant cp or an O(1) term.

H(x) = −log2m(x) = −log2Q(x)

This approach provides an alternative derivation of equation (3.24). For prac-tical reasons it is convenient to take the universal semi-measure as m(x) =2−H(x).

The Shannon coding theorem (3.18) shows that for a given distributionP (x), a binary self-delimiting code can be constructed that, on average, gener-ates a code for string x that is close to log2P (x) in length and this is optimalfor that probability distribution. However such a code is not universal for allprobability distributions. In contrast, the algorithmic coding theorem showsthat there is a universal code for any enumerable distribution (i.e. one thatcan be specified to a given number of significant figures). This code word forx is the shortest algorithm that generates x. Its length is H(x) or log2Q(x)to within an additive constant cp independent of x.

The universal enumerable semi-measure m(x) is a measure of maximumignorance about a situation. It assigns maximal probability to all objects (as it


dominates other distributions up to a multiplicative constant) and is in effect auniversal distribution. As a universal distribution, the universal semi-measureprovides insight into defining a typical member of a set and provides a basisfor induction based on Bayesian principles as is outlined in the next section.

Inference and Bayesian statistics and the universal

distribution

Laplace developed the concept known as the principle of indifference, or theprinciple of insufficient reason, from an idea first articulated by Jakob Bernoulli.This concept provides a tool to cope with situations where a particular outcomemight have several possible causes. Because there is no reason to favour onecause over any other, the principle states that they all should be treated asequally probable. I.e. in the absence of any evidence that a tossed coin isbiased, this principle would give equal probability to a head or a tail. 2

The principle leads to Bayes’ theorem which determines how likely a par-ticular hypothesis Hi is, given information D. I.e. the probability of thehypothesis given the data is

P (Hi|D) = P (D|Hi)P (Hi)/P (D). (3.27)

This rule, which is little more than a rewrite of the definition of conditionalprobability, provides a basis for inference. If Hi represents a member of theset of hypotheses or models that might explain the data D, then the inferredprobability or posterior probability is defined as P (Hi|D); the probability ofthat particular hypothesis given the data. If the probability P (Hi|D) of oneparticular hypothesis Hi is significantly greater than the alternatives, thereis some confidence that Hi is to be preferred as an hypothesis. I.e. afterseeing data D, P (Hi|D) gives a measure of how consistent the data is withthe hypothesis. However, the Bayes’ approach would seem to be of little valueunless P (Hi), the probability that Hi is the correct hypothesis, is known for allpossibilities. I.e. one would need an accurate measure of the P (Hi) for everypossible hypothesis and ensure that

∑

i P (Hi) = 1.Nevertheless, if there exists some estimate of the probability of a hypothesis

P (Hi) in the absence of any data, that estimate can give a measure of howconsistent the particular hypothesis is, given the data, by using Bayes’ theorem.The principle of indifference mentioned above suggests all P (Hi) are the sameas all causes are assumed to be equally likely. However some information mightfavour one possible cause over another. The assumed a value of P (Hi) is knownas the prior probability or the initial probability, and the resultant probability,P (Hi|D), becomes the inferred probability given the data. Once an estimate ofP (Hi|D) is established for the set of hypotheses, this information can be usedas a new prior for a further estimate of P (Hi|D′) using new data D′. If this

2Of course this approach does not take into account the cost of a mistake in assumingno bias. While each cause can be assumed to be equally likely, one would need a betting orinsurance strategy to cover hidden biases as is discussed later.


iterative process is continued, P (Hi|D) will, depending on the evidence or data,generally favour one hypothesis over others. Thus this process of inferencemakes as much use of the available data, as is possible. However there is amajor difficulty that statisticians from a frequentist perspective have with thisapproach. In the frequentist framework a large sample of outcomes is neededto infer a probability measure. There is no place for assigning probabilitiesfor one-off events, whether these be based on the principle of indifference,or based on estimates of likely causes. Despite this, Bayesian statistics isnow commonly used and Jaynes has encapsulated this principle in his idea ofMaximum Entropy (section 7).

It emerges that, if one considers hypotheses that are finite sets of binarystrings of finite lengths, the universal semi-measure m(Hi) = 2−H(Hi), providesthe best estimate of the prior probability P (Hi). In simple terms, where P (Hi)is not known, 2−H(Hi) can be taken to be the prior probability as a bestguess. Furthermore, Vitanyi and Li [109] argue that that optimal compressionis almost always the best strategy for identifying an hypothesis. The argumentsabove are taken further in the section 7 discussing the minimum descriptionlength approach to model fitting.

The universal semi-measure and inference

Originally, Solomonoff [97], in dealing with a similar inference problem, de-veloped what in effect was the algorithmic approach. Solomonoff argued thata series of measurements, together with the experimental methodology, pro-vided the set of data to feed into Bayes’ rule in order to give the best estimateof the next data point. Solomonoff came up with the concept of the univer-sal distribution to use as the Bayesian prior; the initial probability estimate.However Solomonoff’s initial approach ran into problems as he did not use self-delimiting or codes from a prefix-free set. Levin [78], by defining the universaldistribution in terms of a self-delimiting algorithm resolved that problem. Asa consequence, the universal distribution of section 3.8 can be taken as theuniversal prior for a discrete sample space. Levin has argued, the universaldistribution gives the largest probability amongst all distributions, and it canbe approximated to from below. Effectively m(x) maximises ignorance, or en-capsulates the principle of indifference, and so is the least biased approach toassigning priors. Gacs [58] using 2−H(x) as the universal distribution (see sec-tion 8.2) has demonstrated the remarkable property that the randomness testfor x given the universal semi-measure; namely d(x|m) shows all outcomes ran-dom with respect to the universal distribution (see discussion in section 8.2).If the real distribution is µ(x), it follows that m(x) and µ(x) are sufficientlyclose to each other for m(x) to be used instead of µ(x). Hence m(x) can beused in hypothesis testing where the true distribution is unknown. In otherwords, the universal distribution as a prior is almost as good as the exact dis-tribution. This justifies the use of the universal distribution as the prior toplug into Bayes’ rule, provided that the true distribution over all hypothesesis computable. Effectively, as Li and Vitanyi [79] point out on p 323 that this


process satisfies both Occam’s razor and Epicurus’ principle.

Occam’s razor states that entities should not be multiplied beyond neces-sity; effectively saying that simpler explanations are best. On the other hand,Epicurus had the view that if several theories are consistent with the data theyshould all be kept (Li and Vitanyi [79] page 315). Because any hypothesiswith probability greater than zero is included in the universal distribution asa prior, and the approach gives greater weight to the simplest explanation, theapproach satisfies both Occam and Epicurus.

Solomonoff’s prime interest was in predicting the next member of a possiblyinfinite sequence (see Li and Vitanyi [79] for a reference to the underlyinghistory). However infinite sequences require a modified approach which will beschematically outlined here for completeness. Nevertheless, first time readersmight find it convenient to ignore the following paragraphs.

In the infinite case, an output is not just x, but rather the output of allinfinite sequences that start with x. This set of all such sequences startingwith x is called the cylindrical set (see 2.3) and is denoted by Γx. In contrastto the discrete case, as a UTM that outputs the start sequence x with a par-ticular programme may not eventually halt, possible non halting programmesmust also be included in the summation defining the semi-measure. As a con-sequence, the UTM must be a monotone machine; i.e. defined as a UTM thathas one way input and output tapes (as well as work tapes). The universalsemi-measure for the continuous case is given by;

MU (x) =∑

U(p)=x∗

2−|p| (3.28)

where MU denotes that a monotone machine is to be used. Here x∗ indicatesthat all programmes that output a string starting with x are included. However,while the programme outputs x∗ it might not halt but continue outputtingmore bits indefinitely This equation defines MU (x) as the probability that asequence starting with x will be the output of a programme p generated by thetoss of a coin. The continuous universal semi-measure M has similar but notidentical properties to the discrete case. Furthermore,

−log2M(x) = H(x)−O(1) = KM (x)−O(1),

where KM (x) is the monotone algorithmic complexity - i.e. the length of theshortest programme run on a monotone machine that outputs a string startingwith x. For example, if the start of a sequence is the string x, the approachwould imply that characters in the sequences are distributed according to acomputable measure µ. In which case, the probability µ that the next elementis a y is

µ(y|x) = µ(x, y)/µ(x).


The estimate based on the continuous universal semi-measure M(x) is ofthe same form namely;

M(y|x) = M(x, y)/M(x).

This converges rapidly to the equation immediately above, and Solomonoff hasshown [98] that after n predictions, the expected error by using M(x) insteadof µ(x) falls faster than 1/n.

Chapter 4

The algorithmic entropy ofstrings exhibiting order but withvariation or noise

The algorithmic entropy of a configuration representing a microstate in thenatural world is the length of the shortest algorithm that exactly defines thestring specifying the microstate. However, as such a configuration may exhibitboth structure and randomness, if the algorithmic description is to provideinsights into natural systems, it must be able to consistently deal with differentstrings that each capture the same structure but also exhibit variations ofthe structure. Previously it was thought that the random components woulddominate the length of the generating algorithm obscuring any pattern [94].However this is not so, as these strings, each exhibiting structure and variationsof the structure, have a meaningful algorithmic entropy that aligns with thethermodynamic entropy for equilibrium states (see discussion in section 5.2).For example a string representing the pixels that specify a noisy image on aviewing screen will have some pixels that capture the image while other pixelsare just random noise. The exact description of a particular variation of theimage relies on specifying the common structure and identifying the particularvariation of the structure.

Exactly the same process can be applied to the algorithmic entropy of aparticular variation of the microstate of the system. However, in the imagecase, all the configurations refer to different noisy variations of the same image,whereas in the thermodynamic case all the configurations will be determinedby the essential characteristics of the thermodynamic macrostate in terms ofthe constants of motion and molecular species. The set of configurations withinthe macrostate, or the set of configurations specifying the noisy image, havefeatures that are common to all, but with individual variations. It so happens,that the algorithmic entropy of a typical string in a set of image strings, or atypical microstate in the thermodynamic case, is related to the Shannon en-tropy of the set of strings. As is outlined below the algorithmic entropy consists

53

54CHAPTER 4. THE ALGORITHMIC ENTROPY OF STRINGS

EXHIBITING ORDER BUT WITH VARIATION OR NOISE

of a term defining the structure of the set of microstates in the macrostate anda term defining the uncertainty. I.e. one needs only to identify the string withits pattern from the set of all similar strings. In this case the contributionof the noise or variation to the algorithmic entropy is equal to the Shannonentropy implied by the uncertainty in identifying a particular string. Threeequivalent approaches not only resolve the problem of noisy strings, but alsolink the algorithmic approach more clearly into that of Shannon’s InformationTheory and statistical mechanics. These three formulations, which also alignwith Zurek’s physical entropy [113] and the Gacs entropy, [59] [60] are knownas;

• the provisional entropy approach,

• the Algorithmic Minimum Sufficient Statistics approach, and

• the ideal form of the Minimum Description Length.

The first two are discussed below, while the third is discussed in more detailin section 7. Furthermore, Zurek’s physical entropy [113], which is a mix ofalgorithmic entropy and Shannon entropy, returns the same value as the threeequivalent formulations above, although conceptually different.

4.1 Provisional entropy

Devine [45] has shown that where a particular string (for example one rep-resenting the configuration of a real world system) exhibits both structureand randomness or variation of the structure, there is an implicit referenceto the set containing all similar strings exhibiting the pattern. As this setwill be recursively enumerable, all variations of the string that exhibits thispattern can be generated by an algorithm that consists of two routines. Thefirst routine enumerates the set of strings S exhibiting the common patternor structure and which contains the string of interest, while the second rou-tine identifies the string of interest given the information that defines the set.The algorithmic measure of the particular string xi in the set will be termed“the provisional algorithmic entropy” denoted by Hprov(xi). Devine [45] usedthe phrase “revealed entropy” instead of “provisional entropy” to indicate thevalue depended on the observed or revealed pattern. The term “provisionalentropy” seems preferable, as the observed pattern may not comprehensivelyaccount for all the structure in the string. The measure is provisional only. Itis the best estimate of the algorithmic entropy given the available information,recognising that deeper, unobserved pattern might well exist. The two partdescription then becomes

Hprov(xi) = Halgo(S) +Halgo(xi|S) +O(1). (4.1)

Here Halgo(S) is the length of the shortest self-delimiting programme thatdescribes the set S with its characteristic pattern, while Halgo(xi|S) is the

4.1. PROVISIONAL ENTROPY 55

length of the shortest self-delimiting programme that picks out xi in the set.The contribution of the second term to the algorithmic entropy is in practicethe same as the Shannon entropy of the set. I.e. if there are ΩS elements in theset, and all members are equally likely, each string can be identified by a codeof length |Ω|. In this case, Halgo(xi|S) = log2Ωs, and the provisional entropybecomes;

Hprov(xi) = Halgo(S) + log2ΩS +O(1). (4.2)

Or in general denoting Hsh as the Shannon entropy of the set for the situationwhere different strings have different probabilities, the overwhelming majorityof elements in the set will have the provisional entropy given by;

Hprov(sj) ≃ Halgo(S) +Hsh (4.3)

Nevertheless, just as the Shannon entropy is a measure of the uncertainty inidentifying each member of the set, the equivalent provisional entropy termrequires the same number of bits as the Shannon measure to identify the par-ticular member x of the set given the set’s characteristics. In other words, theprovisional entropy is the entropy of a typical member of the patterned set.

As was discussed in section 3.3, if the information required for self-delimitingcoding can be built into the algorithm that defines the members of the set, itcan then be fed into the algorithm that identifies the actual string in the set.In effect;

Hprov(xi) = Halgo(description of the set′s structure)

+Halgo(identifying the string in the set). (4.4)

Even if all the structure of a typical string has been identified, there will be a fewhighly ordered non-typical strings that in theory might show more structure.

However there are implications for this in establishing the entropy of a phys-ical system when the system is isolated. For example, what appears to be anordered fluctuation of the electronic states, will be associated with a disorderedfluctuation elsewhere in the system, as energy is conserved. Another exampleis where the instantaneous configuration of the particles in a Boltzmann gasmight be uniformly distributed over space appearing as accidental ordering.However as is discussed later, for real thermodynamic systems, in contrast tomathematical strings with no physical significance, when such an ordered fluc-tuation occurs in one part of the system, such as the position states, in order toconserve energy, entropy must shift to the momentum states of the system. Itis only when the disorder shifted to the momentum states increasing the tem-perature, can the disorder pass to a colder environment. It is very difficult tobe clear about what is meant by an ordered fluctuation in an isolated system.This is discussed further in Section 6.3.

As the upper measure of the algorithmic entropy, given the available infor-mation, is the provisional entropy, the provisional entropy is the best estimateof the actual entropy of a particular variation of a common structure, repre-senting for example, the microstate of a physical system. If all the structure



is identified, the provisional algorithmic entropy will correspond to the truealgorithmic entropy, resolving the problem raised in section 3.1. Nevertheless,as there is no certainty that all the structure has been identified, the mini-mum description H(xi) ≤ Hprov(xi). The name “provisional” indicates thatunrecognised structure may exist. Nevertheless, the discussion shows the con-cerns about the algorithmic approach, raised in section 1.2 are unwarranted.While the algorithm exactly specifies a particular string, the variation or noisedoes not swamp the algorithm.

4.2 The Algorithmic Minimum Statistics Approachapproach to the provisional entropy

As has been discussed, the provisional entropy is the algorithmic entropy of atypical string in a set of similar strings. Earlier Kolmogorov introduced thealgorithmic equivalent of Fisher’s minimum sufficient statistic concept knownas the Kolmogorov Minimum Sufficient Statistic, or the Algorithmic MinimumSufficient Statistic (AMSS). The AMSS for a string in a set of strings turns outto be identical to the provisional entropy. The approach, which is implicit insection 2.1.1 in Li and Vitanyi [79], seeks the smallest set in which the stringof interest is a typical member. The length of the algorithm that identifies thestring in the smallest set is the provisional entropy. However with the AMSSor the provisional entropy, one can never be sure that the minimal set has beenfound.

Prior to the work of Devine [45] on the entropy of patterned noisy sets,Vereshchagin and Vitanyi [108] developed the AMSS approach to specify theshortest description of a string by optimizing both the length of the programmeto generate the string and the description of the TM running the programme.It is a minimum statistic as, for a typical member of the set, all the informationhas been captured and no further pattern is detectable in the string. If thesmallest set captures all the regularity in the string, the shortest algorithmicdescription of the set, together with the algorithm that identifies the string inthe set, should correspond to its algorithmic entropy. This set could be theset of microstates in a thermodynamic macrostate, or the set of configurationsthat specify a noisy image. The approach optimises the combination of theTM Ti over all machines i where Ti(p) = x and the programme p. In whichcase (see section 3.13);

H(x) = mini,pH(Ti) + |p|+O(1). (4.5)

There are many TMs that can generate a string x. In order to describe a stringwith randomness and structure there is a trade-off between a TM that has ashort description, but which requires a longer programme |p|, and a machinewith a short programme and a long description. The optimum occurs when allthe pattern or structure is included in the description of the TM and only thevariation or random part, is captured by what is called in this context the pro-gramme p, but which in reality is a code to identify a particular string given the

4.2. THE ALGORITHMIC MINIMUM STATISTICS APPROACHAPPROACH TO THE PROVISIONAL ENTROPY 57

structure. Again the shortest description of a particular string involves findingthe optimum set where the string is a typical member; i.e. relative to othermembers in the set it is random. As a TM is a function, it can be considered asa model that captures the essential structure or pattern in the observed string.In the language of sets, the model defines a set of string exhibiting a commonstructure or pattern. All the structural information in the string is captured bythe model embodied in the function Ti, while p in Equation 4.5 that identifiesthe particular string in the set. In the provisional entropy understanding theprogramme that defines the set, with its characteristics, corresponds to theTM instructions that embody the model, while the remainder, the variation isequivalent to the Shannon entropy given by a code that identifies the stringof interest. If the description of the model is too sophisticated, it will over fitthe data, but the reduction in the code length given by |p| is insufficient tocompensate for the increase the length of the model description. Alternativelyif model is too simple, the size of the code length |p| will need to increaseto provide the detailed corrections. For a given x, the optimum is found by(4.6) below. This is identical to equation 4.1 The algorithmic complexity ofthe string xi is [108]

H(xi) ≤ H(S)H(xi|S) +O(1). (4.6)

Similarly from a provisional entropy point of view, H(xi) ≤ Hprov(xi) whereHprov(xi) equals the right hand side of equation 4.6. When the equals signapplies,H(xi) = HAMSS(xi) is the Algorithmic Minimumm Sufficient Statistic.It is a minimum statistic, as for a typical member of the set all the informationhas been captured and no further pattern is detectable in the string. As for theprovisional entropy, if this set S is given, and ΩS denotes the number of equallyprobable members of the set, H(xi|S) ≤ log2ΩS + O(1). H(xi|S), the lengthof the description of xi given S, is the length of the algorithm that identifiesthe particular xi within the set. A typical random member of the set will bewithin a constant c of the log2ΩS , i.e. H(xi|S) ≥ log2ΩS − c. Just as for theprovisional entropy (4.8) the string is defined by a two part algorithm. In theAMSS case, the first describes the structure of the optimal set - the simplest setin which the string is random with respect to other members. The second partidentifies which string is the string of interest. This second term captures thelevel of uncertainty arising from the data. In general, the algorithmic sufficientstatistic of a representative member of this optimal set S is.

HAMSS(xi) = H(S) + log2ΩS +O(1).

Here as before, H(S) represents the description of the set and log2ΩS is thelength of the algorithmic description of a typical member of the set and returnsvirtually the same value as Shannon entropy measure of the uncertainty impliedby the number of members of the set. While the main mathematical discussionin the literature uses the term AMSS to define the optimum set in order todefine the shortest algorithm, here the word provisional entropy will be morecommonly used as it links more easily into the entropy concepts of the naturalsciences.



4.3 Practicalities of identifying the shortest AMSSdescription using the AMSS or the provisionalentropy

Coding the jth string in the set

This section should probably be skipped by most readers. However I have writ-ten it as I needed to know how to make the provisional entropy concept workin practice. What does the algorithm that produces the provisional entropylook like? This section looks at the details of how to specify a particular stringin a set of strings with the same basic structure to make the procedure moreconcrete.

In order to identify the string within the set of similar strings, the algorith-mic code for the particular string must be able to be decoded. For simplicity,assume that all the Ωs strings in the set exhibiting the pattern have the samelength. Each can be placed in lexicographic order and assigned a code depend-ing on that order. If all strings occur with probability 1/(Ωs), from equation(3.18), the minimal code length of the jth string will be |code(j)| = ⌈log2Ωs⌉.This length will be taken to be N . The practicalities of this are seen in thesimple example where there are, say, 220 members of the set. In which case,log2220 = 7.78 showing that all the strings in the set can be identified by an 8digit code ranging between 00000000 for the first member to 11011011 for the220th member (i.e. the binary of 220-1). In general, the jth sequence in theordered list is coded by 00000 . . . 0+ binary specification of (j−1). While allsuch codes are self-delimiting with respect to each other, i.e. none is a prefix ofany other, they are not self-delimiting with respect to other instructions andsub-routines in the generating algorithm. The simplest way to make the codefor a particular string self-delimiting relative to other instructions is to passinformation on the size of N , in the form of specifying H(N) the length of thecode for N . I.e. for the 220 strings above N = 8 and this length is passed tothe read instruction by 3 bits representing needed to specify 8. H(8) = 3. Thealgorithm that generates a particular string will need to include the following.

• The code code(sj) for the jth string will have length N = |Ωs|, which isclose to log2Ωs. This code is key to the subroutine identifies the string j,given the set characteristics.

• However, the algorithm also must have implicit or explicit informationabout the length, of code(sj) to know when the code has been read. Eitherthis length information H(|codej | = H(N) must be given to indicatewhen code(sj) has finished being read, or else the code is self-delimitingwith length H(N) + |code(sj)|.

• The programme outlined below needs to information on the length of theinitial string 0000 . . . 00, i.e.the length N that is stored in the computer.This in effect provides the self-deliming code for H(N) = |N | . This

4.3. PRACTICALITIES OF IDENTIFYING THE SHORTEST AMSSDESCRIPTION USING THE AMSS OR THE PROVISIONAL ENTROPY59

information is available by including the length of the STORE routinein the form |STORE(‘0’, N)| = |0|+ |N |+O(1).

• A specification that defines the set of strings with a common structurecan be taken to the subroutine ‘CRITERIA’, with an output ‘YES’ ifa string is a member of the set and ‘NO’ otherwise. This algorithmicentropy of the subroutine that defines the elements in the set is H(S) =|CRITERIA|

The programme that generates the jth string takes the following form..

INPUT H(N) specifies the number of bits needed to read code(j)INPUT code(sj) H(N) bits long

s = STORE(0, N) creates a string of N zeros.FOR I = 0 to code(sj)

A. GOSUB CRITERIA Output = NO or Y ES1 NO s = s+ 1

Go to A

Y ES s = s+ 1

NEXT I

PRINT s− 1 Needed as the loop has stepped too far (4.7)

Step A specifies the criteria that define the set. This algorithm steps through allpossible sequences, s and counts all those that fit the criteria. I.e. incrementsi until the jth patterned string is reached. As a consequence, the provisionalalgorithmic entropy of the string can be expressed as:

Hprov(sj) = |code(j)|+H(N) + |criteria defining patterned set|+O(1).

Here the O(1) terms include standard instructions like STORE, PRINT, GO-SUB etc. If necessary, the zero of entropy can be chosen to take these as given.The length of the code H(N) needed to self-delimit codej is specified directly.However it could just as easily be part of the criteria defining the set. I.e.we could just step through all codes every length and only accept those oflength N . This has the advantage that the two part code now includes just theShannon entropy and the set criteria giving;

Hprov ≃ log2Ωs + |criteria defining patterned set|+O(1).

In more formal terms, the algorithmic entropy of string sj can be expressed as;

Hprov(sj) = H(S) +H(sj |S), (4.8)

where the first, term H(S), is the length of the routine that characterisesthe set’s pattern or structure while the second term that identifies sj by itsposition in the set is the Shannon entropy, while. If the set of strings is defined



by a model, the model’s parameters provide the information to specify the setmembers. When log2Ωs >> H(S), i.e. Ω is large relative to the definitionof the set, Hprov ≃ Hsh. For the opposite and trivial case, when the set hasonly one or two members, the log2N term becomes the main entropy term asis discussed in section 4.4 .

A simple example

Devine [45] has used the above approach to determine the algorithmic entropyof a set of finite sequences where the 1’s and 0’s occur with a fixed probability.Such a sequence can be envisaged as representing the microstate of a physicalsystem at a given temperature as in a system of N two level atoms whereonly some atoms are in the excited state or alternatively a system of N spinswhere, some spins are aligned with a small magnetic field and others are not.In the first case a ‘0’ indicates that an atom in the system is in the groundstate and a ‘1’ that it is an excited state. The provisional entropy consists ofthe algorithm that determines the characteristics of the set and, given the set,identifies a particular string within the set [47]. For example, if the system isisolated, the number of 1’s is a constant of the motion as energy is conserved.The microstate of the system can be specified by a string of length N with Rones and (N −R) zeros. These form the hypergeometric sequence (see Jaynesp 69 [18]) generated by drawing a finite number of zeros and ones from an urnuntil the urn is empty. If the urn has N objects made up of R ones, the setof the sequences with different arrangements of R ones and (N −R) zeros hasΩs = N !/[(N − R)!R!] members. The provisional algorithmic entropy of thestrings in the patterned set, which correspond to different microstates of thesystem, can be established by method outlined in the previous section. LetCOUNT (S) be the subroutine that counts the number of ones in the string S.If COUNT (S) 6= R, the string S is not a member of the patterned set. Thesubroutine “CRITERIA” (4.7) becomes

CRITERIA

IF COUNT (S) 6= R, (I.e. does not have the desired number of1′s)

NO, ELSE Y ES. (4.9)

With this membership criterion, and ignoring the O(log2log2) terms for |R| and|N |, the relevant entropy for a microstate is: Hprov ≈ log2Ωs + log2N + log2R.Effectively log2N bits are required to specify the length of the string and toset all cells to zero; log2R bits are required to specify the number of 1’s, andthe Shannon term log2Ωs = log2(N !/[(N −R)!R!]) bits are required to identifythe particular configuration of the R ones. If all the atoms or spins were in theground state, effectively at zero temperature, the algorithmic entropy would be≈ log2N as the Shannon term would not be needed. The increase in algorithmicentropy due to the rise in temperature is log2Ωs + log2R. The increase is dueto an increase in uncertainty and the need to specify the pattern. However,for large R, given N , rather than specify R directly, it may be algorithmically

4.4. THE PROVISIONAL ENTROPY AND THE ACTUALALGORITHMIC ENTROPY 61

shorter to calculate R from R/N with a short algorithm when R/N is a simpleratio such as 3

4 . When a short specification happens, it can be seen as a form ofaccidental ordering. However, as energy or bits must be conserved in a naturalsystem, this can only happen if energy is lost or transferred another part ofthe system. In other words if the simple system described, is a natural system,it cannot exist in complete isolation from the vibrational or momentum stateswhich are essential to fully describe the system. Later in section 9 this issue isdealt with in more detail.

4.4 The provisional entropy and the actual algorithmicentropy

There are a number of coding methodologies, such as entropy based coding thatcan compress a patterned string to provide an upper measure of its algorith-mic entropy. The question is how are these methods related to the provisionalentropy or the equivalent AMSS approaches. The relationship is perhaps bestshown by the simple example outlined below. But to recapitulate, a recur-sively enumerable set is one where a computable process exists to generate allthe elements of the set, such as generating all the odd integers in order. Equiv-alently a computable process exists for such a set when one can step throughall possible strings and can identify whether a particular string is a memberof the set. The routine “CRITERIA” did this in the example discussed ear-lier. The example discussed here shows how one gets virtually the same resultwith either process. Furthermore, the results are compared with a process thatenumerates the particular string of interest directly without going through theprovisional entropy approach. The following illustrates how different processesfor identifying short algorithms to define a string are likely to produce virtu-ally the same algorithmic entropy because the major contributors are the same.While it is not always clear which is the shortest it usually does not matterfor a typical string. The example that illustrates this is the noisy period-2binary strings consisting of the set of all sequences of length N of the formsi = 1y1y1y1 . . . 1y1y . . . 1y ( where y is randomly 0 or 1). The 2N/2 membersof the set give log2ΩS = N/2 and the set can be specified by a recursivelyenumerable process. The provisional entropy or AMSS approach would arguethat the entropy is based on the length of the algorithm that defines the set ofsimilar strings, plus a term that identifies the string within the set. The fol-lowing different processes that specify the set of all such strings give effectivelythe same algorithmic entropy..

• One way to enumerate all the strings of the set of the noisy period-2binary strings is to start with N/2 zeros and increment the string by onein turn to form the N/2 binary strings from 000 . . . . . . 00 to 111 . . . . . . 11.This generates all the variations of the string y, y, y . . . , y, y, where y is0 or 1. i.e. the random component of the period -2 string. Each period-2 string itself can be produced from the random part by dovetailing a1 between each of the binary characters in the N/2 binary strings. The



required algorithm needs to specify N/2 zeros which takes log2(N/2) bits,the small stepping and dovetailing routine and finally requires N/2 bitsto specify the string of interest, the ith string in the set. Ignoring thestepping routine the provisional entropy is,

Hprov(si) ≃ N/2 + log2(N/2) + |1|+ |0|.

• Alternatively if a checking algorithm is used to identify which strings arein the set and then specifies the ith string in the set, the algorithm thatdefines the set’s members will be ≈ log2(N/2) as it will need to readN/2 characters to determine whether the first member of each pair is a‘1’. This is the requirement that defines the set. A further N/2 bits arerequired to identify the ith string in the 2N/2 strings in the set. Thealgorithmic entropy is the same as the first bullet point if the checkingalgorithm that specifies whether a string is in the set can be ignored.

• A third method is to find an an algorithm that directly specifies thestring. A particular string in this noisy, period-2 sequence, can be codedby replacing each ‘11’ in the noisy period-2 string by a ‘1’ and each ‘10’by a ‘0’. For example 11101011 . . . becomes 1001 . . .. These codes areself-delimiting with respect to each other and the coded string is half thelength of the original string. However this compressed coded string is,by itself, not an algorithmic measure, as the code does not generate theoriginal string. A short decoding routine needs to be appended to thecode to generate the original string. Additionally the requirement thatthe codes must be self-delimiting relative to the decoding instructions aswell as with respect to each other will add another ∼ log2N In whichcase the relevant algorithmic entropy based on this coding is,

H(si) ≃ |code(s)|+ |decoding routine|+O(1).

In this case, the decoding routine scans the coded string and replacesa ‘1’ by a ‘11’, and a ‘0’ by a ‘10’. In order to be self-delimiting, theoverall specification of the string s must contain information about thelength of code(s). This can be done by either including it in the code,giving |code(s)| = N/2+ |N/2| etc., or including ∼ |N/2| in the decodingroutine so that the routine knows when to finish reading the code. Forlarge N , and ignoring the details of the O(1) decoding routine, the resultis virtually identical to the previous approach. I.e.

Hprov(si) ≃ N/2 + log2(N/2) + |1|+ |0|. (4.10)

The point to note is that several algorithms can generate a particular stringwith the required characteristics. While it is not obvious which of these algo-rithms is the shortest, and therefore which gives the most accurate algorithmicentropy, the dominant terms are the same, the difference lies in relatively shortalgorithms that specify stepping, checking or decoding routines. Chaitin has

4.5. THE SPECIFICATION IN TERMS OF A PROBABILITYDISTRIBUTION 63

argued that there is a cluster of short description of similar lengths that dom-inate all the algorithms that can generate a particular string.

All the above approaches assumes there is no hidden structure. If furtherstructure is identified, then the description of the set is not optimal. Forexample in the set of period-2 sequence, there are strings such as the strings′i = 101010 . . . 10 which are non-typical as a shorter descriptions exists. Inthis case the true algorithmic entropy will be less than the AMSS measureor the provisional entropy. In particular Halgo(s

′i) ≃ log2N corresponding to

“PRINT 10N/2 times ”. In other words, for a non typical member of the set,Halgo(s

′i) ≤ Hprov(s

′i). The first term in equation 4.10 becomes unnecessary

indicating that, as this string is not typical, the set or the model needs tobe refined. For this reason, the entropy measure is termed “provisional, asdiscussed in the previous section, more information may indicate the optimalset is smaller, leading to a lower algorithmic entropy. Nevertheless, in manyphysical situations, the overwhelming majority of members of a set will betypical.

Just to reiterate, as equation 4.4 indicates, the algorithmic entropy of moststrings in a patterned set is given by the length of the algorithm that definesthe set of all binary strings exhibiting the pattern, combined with the lengthof the algorithm that identifies the string of interest in the set. The AMSSor provisional entropy approach provides the shortest algorithm for the typicalmember of the set of similar strings. The typical member is random relativeto the set of similar strings.

4.5 The specification in terms of a probabilitydistribution

The provisional entropy and AMSS approach outlined above specified the al-gorithmic complexity in terms of the algorithm that described the optimal setcontaining the string of interest, together with an algorithm that specified thestring in that set. This is appropriate where all members of the set are equallylikely as when a string represents a configuration of a real world system. Forexample an isolated gaseous laser system might contain lasing atoms in an ex-cited or ground state, incoherent and coherent photons and momentum statescapturing the motion of the particles. All the possible configurations appro-priate to a given energy establish the set of possible states. As each of thepossible configurations in the macrostate is equally likely, the actual state re-quires the set to be defined first and given the set the Shannon entropy termwhich identifies the particular configuration in the set.

However for the situation where all strings are not equally likely and the setof strings can be defined by a probability distribution (such as that generatedby a Bernoulli process), the characteristics of the set can be specified by theprobability distribution of the members. However if a set is to be defined interms of a probability distribution, the distribution must be computable to anagreed level of precision. In this case Shannon’s coding theorem can specify



each string in the distribution by a unique code based on the probability valuefor each data point in the distribution.

LetH(P ) represent the length of the shortest programme that computes thediscrete distribution on the reference UTM to an agreed level of precision. Thealgorithmic entropy of a particular string xi that belongs to the distributioncan now be specified by a two part code. The first part is the descriptionthat captures the structure of the set of possible strings specified in terms of aprobability P , while the second part that identifies the actual string given thedistribution. Hence

H(xi) ≤ H(P ) +H(xi|P ∗). (4.11)

An equals sign in equation 4.11 holds for the overwhelming majority ofstrings in the distribution, but a few strings will have a shorter description.For example, in a Bernoulli process where 1’s are emitted with a probabilityof p and 0’s emitted with a probability 1− p, the majority of strings of lengthn, consisting of a mix of r 1’s and (n − r) 0’s, will occur in the set withthe probability distribution pr(1 − p)n−r. In which case an equals sign isappropriate. However a few non-typical strings, such as those where 1’s takeup the first r places in the string, can be algorithmically compressed furtherthan that implied by equation 4.11 and an alternative distribution or shorterdirect algorithm is needed. The right hand side of this equation is also theprovisional entropy or the Algorithmic Minimum Sufficient Statistic for thestring.

According to Shannon’s coding theorem (equation (3.18)), each member ofthe distribution can be optimally represented by a code such that |code(xj)| =−⌈log2P (xi)⌉, where the ceiling function indicates the value is to be roundedup. Rather than continually using the ceiling notation, as is the custom, thisinteger will be represented by −log2P (xi) recognising that the code length mustbe an integer. With this in mind

H(xi) = −log2P (xi) +H(P ) +O(1), (4.12)

for a typical member of the set. Also, the O(1) includes any decoding routineif it is not explicit in the algorithm. 〈Halgo(xi)〉 the expected value of H(xi)given this distribution, is found by multiplying each term in equation (4.12)by the probability of xi and summing over all states. As the overwhelmingmajority of contributions to the sum have the equals sign (s in (4.12)) andthose that do not have very low probabilities, the expected code length is closethe Shannon entropy, Hsh as H(P ) is small. (This is discussed in more detailin section 5.2. In other words,

〈Halgo(xi)〉 = Hsh(set of all states) +H(P ) +O(1) (4.13)

The O(1) decoding routine is implicit in the algorithm that generates the string.Given the code for xi, where |code(xi)| = −log2P (xi) the decoding proceduresteps through all the strings with the same probabilities in order, until the xi

associated with the correct code is reached. Zurek [113] Equation 4.17 and


Bennett [10] Equation 2 use this argument to show that the expected value ofthe algorithmic entropy for a thermodynamic ensemble is virtually that of theShannon entropy.

However in practice, codes and strings must be ordered consistently. Forexample, those strings with the same number of zeros and ones in a Binomialdistribution will have the same probability. If codes and strings are orderedlexicographically for the same probability, the fifth string, in lexicographicorder is specified by the fifth code, in lexicographic order associated with thatprobability. Shannon’s coding theorem works because there is a discrete set ofrandom variables and P (xi) is the corresponding probability. If, as is discussedlater, a probability density is used to specify noisy data, the probability needsto be computed to an agreed level of precision.

How to specify noisy data

The provisional entropy or the Algorithmic Minimum Sufficient Statistic ap-proach showed how a two part algorithm can provide an algorithmically com-pressed specification of a particular string with recognisable structure but alsowith variation of the structure. The first part of the algorithm specified theset of all strings with the same pattern or structure and the second part, giventhe set, specified the string of interest.

Following this, section 4.5 extended the process to another two part algo-rithm where the first part specified the set in terms of an enumerable probabil-ity, and the second part specified the particular string by a probability basedcode. It should be noted that section 4.5 refers to a somewhat different situa-tion than that discussed earlier in section 4.1. In section 4.1, the strings in theset showed variations of the the same pattern or structure. These variationsmight be deemed random in the sense that there was no other observable struc-ture. However the issue in section 4.5 is where the string specifying a structureor an outcome is approximately specified by a model, but a computable proba-bility is used to specify deviations from the model. This is the situation where,given the model, the probability of a fluctuation from the model’s prediction,needs to be specified.

The difference can be seen by defining the algorithm that specifies a partic-ular configuration of 100 two state atoms where the set of possibilities requirethat 60 atoms are in the excited state and 40 in the ground state. Using 0 torepresent the ground state and 1 to represent the excited state of each atom,the possible configurations all have the same number of bits. The number ofvariations is 100!/(60!40!) = Ω = 1.374 × 1028. As log2Ω = 94, each possibleconfiguration can be expressed according to equation 4.4 by 94 bits while fur-ther bits are needed to define the requirements of the 60:40 set. However adifferent case is where a particular 60:40 configuration, has ay 60 ones followedby 40 zeros, but with noise imposed. Given that on average the string willexhibit the 60:40 ratio, how can the shortest string be expressed if there is agiven probability p that any particular bit will change from 1 to 0 or vice versa.There is a finite probability of an extreme fluctuation where all the bits are



zero. In order to determine this extreme probability, let the probability thatone of the 60 1s will be a zero be 0.1, while the probability that a 0 will remaina zero is 0.99. The probability of a the above 60:40 string having a fluctuationwhere all bits are zero is 0.9940 × 0.0160. The probability is about 2400. Thisvariation is coded by a probability code of length H(P (s)) = −log−400

2 = 400.As this fluctuation is extremely rare its code length is large. However the ex-pected value of the length of this code over all possibilities is still the Shannonentropy. In general, the algorithmic entropy of this string is the code for thisstring and the specification of the 60:40 model as described n section 4.5. I.e. itis of the form H(s) = H(model) +H(P (s)) where the function H(P ) specifiesa particular variation with length H(P (s)) = −log2P (s).

The practicalities of the probability approach are outlined in the next sec-tion. However where there are noisy variations of each microstate in a set,a combined approach may be necessary to identify each core string with aspecified pattern and noise imposed on each.

Model examples

An example of the deviation from the model in two dimensional space mightbe the straight line model that optimally fits a noisy set of data describedby the equation yi = axi + b + noise = y(xi) + noise. The deviation ornoise is yi − y(xi) and occurs with a probability distribution P (yi − y(xi)).This distribution might for example be the normal distributions. In order tospecify a particular string in the set of strings using an algorithm, one needs todetermine the noisy deviation knowing the probability P (yi − y(xi)) for thatdeviation. From Shannons Coding Theorem, the deviation can be optimallycoded by |code(yi − y(xi))| = −log2P (yi − y(xi)). I.e. given this code, thedeviation from the model can be generated from its code and the algorithmthat specifies the probability as outlined in section 4.5.

In the straight line example outlined, the provisional entropy for a particularstring yi given xi, is the length of the algorithm that defines the string from itsequation and the associated probability distribution, plus a code that specifiesthe deviation of that string from the equation. The result is

Hprov(yi) = H[axi + b, P ]− log2P [yi − y(xi)] + |decoding routine|+O(1),

or

Hprov(yi) = H[axi + b, P ] + |code(yi − y(xi))|+ |decoding routine|+O(1).

Here the decoding routine is specifically identified. and the O(1) terms areto allow for short codes to join subroutines. The first term captures both theshortest description of the model and its associated enumerable probabilitydistribution, while the second term specifies the deviations from the exactmodel following the arguments of section 4.5. The expectation value for thesecond term is the Shannon entropy of the deviations.


This approach effectively is selecting the Algorithmic Minimum SufficientStatistic (AMSS) for the string yi or the best estimate of the provisional en-tropy. It is closely related to the Minimum Description Length (MDL) approachdiscussed later in the section 7 as the ideal form of MDL gives the same modelfor noisy data. The algorithm that generates point yi given point xi containsthe probability based code, a decoding routine, the probability distributionneeded to calculate the deviation from the model, and finally the specificationof the model. The algorithm undertakes the following computational steps;

• Takes the code and knowing that |code(yi− y(xi))| = −log2P (yi− y(xi))generates the value of the probability for that string;

• Calculates yi−y(xi), the deviation due to the noise, from the probabilityvalues using the routine for generating P ;

• Generates y(xi) from axi + b;

• Adds the noise given by yi − y(xi) to y(xi) from the second bullet pointabove and generates the specific yi as required.

A more general example

Instead of listing the earth’s position for every second of a year, Newton’slaws can capture the motion in a set of equations. If the equations exactlydescribed the earth’s motion around the sun, the equations represent an ex-tremely compressed version of the data. However the capture is not exact, asvariations occur. The variations from the model might be due to measurementerror, perturbations from the moon or a breakdown in the assumptions of sim-ple spherical bodies etc. Where the deviations appear as noise they can berepresented by a probability distribution. In this situation the observationaldata can still be compressed by describing the data algorithmically - initially interms of Newton’s laws (which specifies the particular string should Newton’slaws exactly predict the data point) and then by coding how the observed datapoint deviates from the simple predictions of Newton’s laws. If later a furthercause or pattern in the variation is uncovered, as might occur by taking intoaccount the moon’s effect, the compression can be improved.

A good model is one that has maximally compressed the data. All theregularity is in the model while, the variation (which may be due to the limitsof the model, noise, or experimental error) will appear as random deviationsfrom the model. This again is the argument used to find the AlgorithmicMinimum Sufficient Statistic for the data discussed in section 4.2 but wherethe model is captured in physical laws, and the randomness captured by thecode that identifies the particular string given the laws.

As is discussed in the next section, the model that best fits a series of datapoints is the shortest algorithm that that defines the model and the deviationsfrom to exactly fit each data point for each point yi.



Model selection

The previous two sections show how to specify a string representing a datapoint given an understanding of the laws which generate the string and thedeviation from the laws. Vereshchagin and Vitanyi [108] apply the ideas toprovide a comprehensive approach to selecting the best model that describes aparticular string or strings based on the AMSS approach of Kolmogorov. AsVereshchagin and Vitanyi [108] point out “the main task for statistical inferenceand learning theory is to distil the meaningful information in the data”.

In a nutshell, given a set of data, the best model of that data is the onethat captures all the useful information. In general, a model, based on naturallaws, provides the best compression of the data at hand. It follows that onecan try explaining the data with different models and select the one with theminimal description as the best explanation. Hence for each credible modelof the data, a two part algorithm can be used to define each string. The firstpart uses the model and the second part codes the deviation from the model.The model with the shortest overall description is the preferred model. Thealgorithm describing a data string using the optimum model for that string isthe Algorithmic Minimum Sufficient Statistic as all the useful information iscontained in the algorithm. As Vereshchagin and Vitanyi [108] explain; “Forevery data item, and every complexity level, minimizing a two part code, onepart model description and one part data-to- model, over the class of models ofat most given complexity, with certainty selects models that are in a rigoroussense are the best explanations of the data”. The two part code that providesthe shortest algorithmic description of the model, plus the deviations from themodel is preferred. In other words it is a short step from finding the shortestdescription of a string to finding the best model that models a set of strings.

With this in mind, the algorithm that generates a typical string in a setof strings defined by a model and the deviations from the model, requires thefollowing components.

• The description of the model, which in the straight line model describedabove requires 2d bits to specify a and b to a precision d.

• The description of the deviation from the model. I.e. given an inputxi, the deviation is yi − y(xi) , where yi, is the predicted value for aperfect model. The deviation yi − y(xi) is distributed according to adiscrete probability model. According to Shannon’s Coding Theorem,the deviation can be optimally coded by a code of length = −log2P [yi −y(xi)]. For a Gaussian distribution where each xi are specified to a givenlevel of precision leading to a discrete probability distribution, the logprobability term depends on (y − y(xi)]

2/(2σ)2 − log2(σ√2π) noting the

standard deviation, σ is determined from the data. If the outcome was notGaussian or another symmetrical probability distribution, the implicationis that hidden structure exists and the deviations would not be random.In which case, a more refined model would be needed.


It should be noted that the parameters a and b, and the characteristics of thedistribution are not determined by the model selection process. The parametersmust arise from other processes that determine the best model. In the simplestraight line case they might be determined by a fitting procedure; for example,one that minimises the sum of the squared deviations. Consider the situationwhere it is suspected that the straight line model is the best explanation of a setof data. As is discussed later in section 7, if a 2nd degree polynomial provides abetter fit than the straight line, this would be because the increase in the modeldescription to three parameters is offset by a reduction in the deviations fromthe model. In the 3 parameter case an extra d bits are required to describe themodel (3d bits in all) and this must be offset by a decrease in the deviationrepresented by the log probability term, i.e. (yi−y(xi))

2/(2σ)2. The limitationsof the algorithmic approach can now be seen. The best parameters for eachmodel must first be determined before the models can be compared. Howeveronce the best model is found by whatever means, the provisional algorithmicentropy based on the model is the best estimate of the entropy of each string.

The model selection approach is identical to the ideal Minimum DescriptionLength approach discussed later in section 7. It is only valid if the string ofinterest or the strings of interest are typical strings in the two part description.They must be random in the set with no distinguishing features. If this is notso, as was discussed in section 4.5, another probability distribution may bebetter.

One further point, the model obtained by minimising the two part descrip-tion is the best explanation of the data, however it is not necessarily the correctexplanation.

Chapter 5

Order and entropy

5.1 The meaning of order

In principle it is possible to extract work from two physical systems, or twoidentifiable parts of one physical system, provided one is more ordered thanthe other. In doing so, the total disorder of the universe increases. However inorder to extract work, the observer must be able to recognise the order and itsnature so that it can be usefully harnessed. A set of ordered states is a subset ofall the possible ordered and disordered states of a system. In general it is onlywhere members of a subset exhibit a common pattern that the order can berecognised and described. To an alien, one string of human DNA would showno obvious pattern as the chemical sequence would be seen as just one of themyriad possibilities. While and intelligent alien might have the capability toundertake a randomness test (section 8.2) to determine whether the discoveredDNA string does show order, this is unlikely to work as the DNA string maybe maximally compressed. The surest way for the alien to identify a pattern isby observing the set of DNA from a variety of mammals noting there is muchin common.

In general, as section 8.2 argues, the degree of order, or the lack of ran-domness, is difficult to quantify. Nevertheless, an ordered sequence can berecognised as exhibiting a pattern because our minds, like that of the alien,perceive that the sequence belongs to a subset of similar patterned sequences.Our minds see this subset is distinct from the set of apparently ordinary or ran-dom strings. If the pattern is recognised, any ordered sequence s of the subsetcan be compressed, i.e. be described by an algorithm much shorter than thesequence length. Thus any common pattern implies that an algorithm s∗ existswhich can specify the patterned string and for which |s∗| << |s|. Once sucha pattern is determined empirically by observation or some trial process, thefact that no computational procedure exists to determine the pattern becomesirrelevant. If it exhibits a pattern it can at least be partially compressed (seethe comments of [95] and [41].

A simple example illustrates the argument. Consider the set of all strings

71

72 CHAPTER 5. ORDER AND ENTROPY

of length N of the form 1y1y1y . . . 1y1y where y is randomly either ‘0’ or ‘1’.As the set has a recognisable pattern a sequence from the set is able to becompressed into a shorter description. This contrasts with a general sequenceof length N , where no order is discernible and there is no certainty that anyorder exists. As the typical member s of the full set will not be able to becompressed, in general |s| ≈ the length of the string. In more formal terms,any patterned set is a subset of all possible strings.

Devine (2006) argues that the use of AIT framework to specify the algo-rithmic complexity of patterned approach has some advantages. Also, it wouldappear to be applicable, at least in principle, to systems that are not just sim-ple patterns with variation or noise. Indeed, if AIT concepts can be extended,they offer some interesting potential applications.

For example, a particular plant at an instant of time is specified by an al-gorithm based on its DNA, and the instruction set encoding the physical andbiochemical laws (as is discussed in more detail in section 6.3). The inputs tothe algorithm are the accessible nutrients and the impacts from the externalenvironment that determine the plant’s growth trajectory. The algorithmicentropy describing the plant’s configuration at an instant of time is generatedby an algorithm that halts at that instant. As time flows the same algorithm,halting at a later time, generates a more biologically complex pattern. Thealgorithm that describes the plant growth is copied in each cell. The inputsto each cellular algorithm are the chemical outputs of the neighbouring cells,neighbouring structures and the external environment mediated by neighbour-ing cells.

A biological cluster of cells would appear to be something like an attractorin a state space appropriate to an extremely complicated non-linear system assection 9.4 outlines. The components of this cluster maintain homeostasis asthe whole is maintained in a stable region by balancing interactions betweenthe cells. Where homeostasis cannot be maintained, the algorithm terminates,causing death; perhaps because of an external input shock; perhaps because thealgorithm gets corrupted by repeated copying; or perhaps because resource orphysical limits constrain the continued reproduction of the cellular algorithm.

The set of all cells in a plant or animal have an underlying pattern and avariation. For example, while a cell in an eye is different from one in a muscle, itis specified by the same DNA and differs only because of different environmentalinputs. In principle, two algorithmic approaches to specify different cell typesare possible. The first might specify different cell types by a string from the setcontaining the different variations of the basic cell pattern. The second mightspecify instead the algorithm that generates the cells from the seed DNA, butwhich also includes information on the mechanism that determines the genesthat are expressed for the different cell types. In section 6.3 it is argued that,once everything is taken into account, the shortest algorithm is the one thatgenerates a plant from the seed. Nevertheless, which ever approach gives theshorter algorithm, AIT can be used to investigate any biologically complexsystem (i.e. an organised complex system) at different levels of scale and, ateach nested level, only the relevant patterned set needs to be explained. Such

5.2. ALGORITHMIC ENTROPY AND CONVENTIONAL ENTROPY 73

an approach would be consistent with the top down approach of Beer’s ViableSystems Model ([8, 7]) which is a framework for probing organised systems interms of nested subsystems. This approach is expanded in section 9 and tiedinto Chaitin’s concept of d-diameter complexity [26].

5.2 Algorithmic Entropy and conventional Entropy

A microstate, which corresponds to a particular configuration in the state spaceof a thermodynamic ensemble, can be expressed as a string of binary charac-ters. Here the state space refers to what is called the phase space, the spacethat specifies the position and momentum of a particle and other states such aselectronic, which, in contrast to the phase space coordinates may be discrete.Where the specification of say a position or the momentum of a particle isnot discrete, it is necessary to specify the coordinates to an agreed degree ofprecision in binary notation. One way to do this is to start with particle 1 andspecify its position coordinate along the x to the agreed degree of precision, fol-lowed by the same process for the y and the z axes The momentum coordinatesof each particle are similarly expressed in turn. The discrete electronic stateof the particle can also be expressed by a binary string. Particle 1’s positionand momentum coordinates are defined in a 6 dimensional phase space. Thisspace which specifies the positions for all N particles is known as Γ space andis made up of 6N dimensions. The string specifying a configuration si lookslike;

si = x1, y1, z1, x2, y2, z2 . . . xn, yn, z,n , px1, py1, pz1, px2, py2, pz2, . . . . . . pzn

Here x1 is the x coordinate of atom 1 etc., while px1 is its momentum coordinatein the x direction. The agreed degree of precision for the specification ofcoordinates is in effect the resolution that imposes a grid or cellular structureon the space. Doubling the resolution by doubling the number of bits specifyingeach coordinate, increases the number of cells in the Γ space by 26N . Zurekuses an alternative approach to define the microstate as a string [113]. Hedivides the 6N dimensional Γ space into cells and then argues that the cellsize is chosen to be fine enough so that a cell either contains one particle orno particles. The configuration of a particular microstate can be specified bysystematically listing the cells in order and denoting an empty cell by a 0 andan occupied cell by a 1. It makes no real difference which way a microstate isspecified provided a simple short routine exists to convert one specification toanother. This can be done for the two approaches just outlined.

Where there is no shorter algorithmic description of the microstate, thealgorithmic entropy is a little more than the length of the description. On theother hand, the algorithmic description of an ordered state will be shorter asthe description can be compressed. Effectively the cellular resolution or gridresolution can be chosen to be physically realistic within certain limits. Strictlythe algorithmic description of the microstate in the classical system requires adescription of the grid structure and the specification of the microstate within


this grid structure. However this is part of the given information, like thephysical laws, and cancels for entropy differences. For a quantum system, theresolution of the grid of cells is determined by the uncertainty principle. Thiscan always be taken to be the limiting resolution requirement for the classicalcase.

The point to note is that the algorithmic approach, like the Boltzmann andGibb’s measures of entropy, also must assume an underlying grid structure.Zurek [113] shows that the ambiguity in the Gibbs entropy, arising from thechoice of coarse graining of the phase space, can be absorbed into the back-ground algorithmic description of the system. In other words, the way thephase space or other continuous coordinates are divided into cells is encapsu-lated in the message about the system. Zurek [113] points out that the overalldescription must be minimal given a particular UTM. Fanciful grid structures,which complicate the coarse graining concerns of the Gibbs’ entropy, requirelong descriptions and become ineligible. Effectively, as Zurek points out, in thealgorithmic approach the complications of the choice of the grid structure isabsorbed into the choice of the UTM. In practical terms, because the criticalphysical parameter is the entropy difference between states, issues to do withthe UTM used, or the message about the system (i.e. the common informationand the resolution) only shift the algorithmic entropy zero and therefore canbe ignored.

Once the specification of the microstate is defined, the length of the short-est self-delimiting algorithm that computes the microstate in the state spacecoordinates is the algorithmic entropy. It is argued below that the algorithmicentropy of a typical or equilibrium state has a formal similarity to the Shannonentropy, Hsh(E) = −∑

pilog2pi of the ensemble E . Nevertheless, there is aconceptual difference between the algorithmic entropy and the Shannon en-tropy. The algorithmic entropy exactly defines a given microstate whereas theShannon entropy is a measure of the uncertainty in an ensemble of microstates.

There are a number of arguments that illuminate the relationship betweenthe Shannon entropy and the algorithmic entropy as outlined in the followingbullet points. The first is more of a “hand waving” argument that providesinsights. The second is more rigorous, whereas the third is a detailed algorith-mic entropy calculation that shows the algorithmic entropy for a Boltzmanngas is the same as the entropy given by the Sackur-Tetrode equation while thefourth shows that for most microstates, the dominant term of the provisionalentropy is the Shannon entropy.

• Let the state space coordinate of the ith microstate of a thermodynamicsystem be specified as a string si of length N . Also let s∗i be the minimalprogramme that generates string si, where si occurs with probability piin the ensemble. If the expected value of the algorithmic entropy stringsi in the ensemble be denoted by 〈|s∗i |〉, the first argument recognises thatthe expected value of the algorithmic entropy for all strings si 〈|s∗i |〉 =∑

i pi|s∗i |. This sum can be rearranged in terms of lengths of algorithmicdescriptions so that 〈|s∗i |〉 =

∑

lpll where pl is the probability of an


algorithmic description of length l. In a thermodynamic ensemble, theoverwhelming majority of possible states belong to an equilibrium set ofconfigurations. These cannot be compressed significantly. If there are 2N

configurations in the ensemble, most will be expressed by an algorithmof length log2(2

N ). In which case the overwhelming majority of stringsin the sum will have a length l close to N . As

∑

lpll becomes l

∑

lpl and

∑

lpl = 1, the result for the expected algorithmic entropy of a string is

approximately N . In other words, the algorithmic entropy of most stringswill be the same value as the Shannon entropy of the ensemble. TheShannon entropy itself is related to the entropy of statistical mechanicsS by S = kBln2Hsh(E), which in turn is related to the macroscopicthermodynamic entropy.

• The second argument following Zurek [113] and Bennett [10] (see also[79]) uses the approach of section 4.5 to highlight the relationship be-tween the algorithmic entropy and the Shannon entropy. Let P (x) bethe computable probability of all the states in a thermodynamic ensem-ble with the Shannon entropy taken as Hsh(E). Equation (4.13) showsthe algorithmic entropy is no more than H(P ) greater than the Shan-non entropy. But because Shannon’s noiseless coding theorem 3.3 showsthat li, the code length for each string, can be made to be within onebit of log2pi the expected value of this length, i.e. 〈Halgo(xi)〉 will atworst be slightly greater than Hsh(E) hence Hsh(E) ≤ 〈Halgo(xi)〉. AsBennett [10] argues, the length of the algorithm that calculates the prob-ability distribution given by H(P ) is only a few thousand bits, whereasthe Shannon entropy is typically extremely large. If the position andmomentum coordinates are defined to 8 significant binary digits, this re-quires 16 bits for each of the 1023 particles in a mole of gas. Thereforeas H(P ) is negligible in comparison, it can be seen that 〈Halgo(xi)〉, theexpected entropy of the microstate, is within O(1) of the the Shannon en-tropy of the ensemble. For a system in equilibrium, or an isolated systemwith many states, the typical algorithmic entropy is virtually the sameas the Shannon entropy.

• The third argument, developed by Zurek, shows that the algorithmicentropy of particular equilibrium state in an ideal Boltzmann gas of Nindistinguishable particles is identical to the entropy provided by theSackur-Tetrode equation [113]. A simple argument to demonstrate thisis to give the address of a particular configuration among the Ω pos-sible configurations. A typical value of the address is ∼ Ω. This re-quires a description of length about log2Ω bits. Any configuration is asequence of 0’s and 1’s. If there are C cells in the phase space, therewill be CN/N ! distinct configurations. As the number of cells is givenbyt C = (V/∆V )(

√

mkBT )/∆p)3, the algorithmic specification leads to

the Sackur-Tetrode equation. Here m is the particle mass, ∆V is the 3dimensional cell volume, and ∆p is the cell volume in momentum space.


This argument shows that the algorithmic entropy of a typical state isthe same as the entropy of statistical mechanics.

• Section 4 and in particular (4.2) show that the provisional entropy ofa string is given by a two part code. The first defines the structureof the set containing the string and the second identifies the particularstring within the set. As argued in the second bullet point above, thelatter term is in effect the Shannon entropy of the set of strings. For athermodynamic macrostate, where the specification of the set is a shortalgorithm, the major contributer to the provisional entropy (the bestguess of the algorithmic entropy) is the Shannon entropy of the set ofstrings in the macrostate.

The algorithmic approach can be used to argue that, over time, an isolatedsystem will tend to an equilibrium configuration corresponding to one of the setof most probable states. As is discussed later in section 6.3, Zurek [113] showsthat where an isolated system evolves under physical laws, after a large numberof steps, but small relative to the Poincare repeat period P, the trajectory willsettle for most of time in the equilibrium region when the states are algorithmi-cally random and a typical description has Halgo(sk) = |s∗k| ≈ log2P = Hsh(E).This is discussed further in section 6.3.

Boltzmann, Gibbs, Shannon and Algorithmic entropy

The Boltzmann, the Gibbs and the Shannon entropies, are in effect measuresof ignorance about the whole system as, in contrast to the algorithmic entropy,they say nothing about the particular state of the system. This section, fur-ther discusses the relationship between the Boltzmann, Gibbs and Shannonentropies and the algorithmic entropy. The difference between the algorithmicapproach and the statistical mechanics approach is first discussed. Followingthis discussion, Zurek’s concept of physical entropy is outlined in section 5.2.Zurek’s entropy measure is a combination of the algorithmic and Shannon en-tropy that captures both what is known, and what is uncertain, about themicrostate of a physical system. Zurek uses this approach to provide a robustunderstanding of reversibility and work extraction from such a system. How-ever, it is now clear that there is no need for a special definition of Zurek’sphysical entropy, as it corresponds to the provisional entropy discussed in sec-tion 4, [45, 47], or equivalently Gacs’ alternative entropy [60] or the AlgorithmicMinimum Sufficient Statistic outlined by Vereshchagin and Vitanyi [108]. De-spite their different labels they are just the best guess algorithmic entropybased on the available information. As a consequence, Zurek’s insight into therequirements to extract work from a physical system apply unchanged to theprovisional algorithmic entropy (or its equivalents) without the need to definea physical entropy. Furthermore, this whole approach would suggest that thealgorithmic entropy is a more fundamental concept than either the Boltzmannand Gibbs entropies (or the Shannon entropy). Indeed, these entropies, allow-ing for units, can be seen as the limiting case of an algorithmic description


when the there is no information about the exact state, but only about theset of possible states The remainder of this section discusses these points inmore detail. However, just to reiterate, the algorithmic entropy measure is ameasure of the microstate of the system; it is the shortest description of anactual state. Nevertheless, the expected value of the algorithmic entropy overthe whole universe of states is virtually the same as the Shannon entropy andtherefore, apart from the units used, is equivalent to the entropy of statisticalmechanics. The alignment is not exact as, in contrast to the other entropies, thealgorithmic approach must specify the conditions of the system under study.These conditions are understood in the other cases. Nevertheless, where theseconditions are common to the specification of the system as a whole, they onlyshift the entropy zero and have no bearing ontropy differences.

The Boltzmann approach considers a gas of N particles in a container.Boltzmann specified the instantaneous macrostate of the gaseous system bythe number of gaseous particles in finite cells in the 6 dimensional phase spaceconsisting of a 3 dimensional position space and a 3 dimensional momentumspace. A given macrostate of the gas specified in terms of temperature, pressureand volume will encompass a myriad of microstates. The number of microstatesis taken to be Ω. Boltzmann describes the statistical entropy using modernnotation as SB = kBlnΩ. Where the 6 dimensional phase space is divided intom cells with Ni particles in the mth cell, the number of microstates is givenby Ω = N !/(N1!N2! . . . Nm!). Using Stirling’s approximation this becomes thefamiliar SB = −kBN

∑

j pj lnpj . Here pj represents the probability that asingle particle is in cell j. For non interacting systems, and a fine grid of cells,this is equivalent to SB = −kBln∆Γ, where ∆Γ is the volume occupied by themacrostate in the 6N dimensional Γ space.

On the other hand, the Gibbs approach envisages an ensemble of replicasof a system all in different microstates and able to exchange energy with eachother. The approach identifies one microstate of a system in terms of the6N position-momentum coordinates; the systems Γ space. I.e. the microstateX = q1x, q1y, q1z . . . qnx, qny, qnz, p1x . . . pnz where the p1x, q1x are the positionand momentum vector of particle 1 etc. In which case the Gibbs entropy is

SG = −kB

∫

ρ(X)ln[ρ(X)]dX,

where ρ represents the probability density of the states X. In practice, theentropy measure is based on a coarse graining of the Γ space into cells of size∆ to avoid the consequences of Liouville’s theorem. In this case, the equationbecomes a sum over all the coarse grained cells S∆ = −kBΣip∆(i)lnp∆(i)and p∆(i) is the average probability that the microstate is in cell i. Thiscontrasts with the Boltzmann measure where the probability measure specifiesthe probability that a particle is in cell i. However, for a large system of noninteracting particles in equilibrium, the Gibbs measure of entropy is the sameas the Boltzmann measure.


Missing Information and the algorithmic description

Where most states are algorithmically random, the system is most likely tobe in an equilibrium state. In which case, as shown above, the value of thealgorithmic entropy is virtually the same as the value of the Shannon entropy.As the equilibrium states dominate the expected algorithmic description, mostalgorithmic descriptions are of the order of log2Ω. Nevertheless, where statesshow some order, the algorithmic entropy is much lower than the Shannon en-tropy and it becomes possible to extract work from the system. It follows thatif no information is known about the microstate of the system at an instantof time, the best guess, i.e. the most probable algorithmic measure, will bethe Shannon value. However an observation may provide information that thesystem is in a non-equilibrium or ordered state. In which case, the algorithmicdescription will be shorter than the description of an equilibrium state. Forexample if all the particles of the Boltzmann gas are observed to be on oneside of the container, the algorithmic description for this ordered state will belower than the Shannon entropy. As more information is known about an or-dered state, it becomes possible to compress the algorithmic description. Thiscontrasts with a system in an equilibrium state where a more accurate specifi-cation of the algorithmic entropy leads to no change. As has been mentioned,this insight prompted Zurek [113] to define the ‘Physical Entropy’ Sd of a mi-crostate. This is the sum of the most concise but partial algorithmic descriptionbased on the available information, and a Shannon entropy term to allow forthe remaining uncertainty. When given information d, the best algorithmicdescription gives the algorithmic entropy as Halgo(si|d). This is equivalent tothe information that defines the structure or pattern in the string, i.e. theprovisional entropy or the AMSS. Zurek sees the remaining uncertainty as cap-tured in the conditional Shannon entropy term Hsh = −∑

k pk|dlog2pk|d. Thephysical entropy, given the available information, is Sd = Halgo(si|d) + Hsh.Zurek used this approach to argue that, when the physical entropy is low, thesystem is ordered and work can be extracted. However, Sd can be seen to bethe provisional entropy by comparison with sections 4.1 and 4.2. Furthermore,as Equation (4.3) is of the same form as the equation immediately above, thereis no need to introduce a new entropy concept. I.e.

Sd = Hprov = Halgo(description of pattern) +Hsh,

as the algorithm that identifies the particular string in the set has the samevalue as the Shannon entropy. The provisional algorithmic entropy is the bestguess of the true algorithmic entropy based on the given information. Asmore information becomes available, the true algorithmic entropy will dropfor ordered states, but remain constant for equilibrium states. Our Figure 2illustrates these two cases (a) and (b) using Zurek’s argument but with theprovisional entropy replacing his physical entropy. As the Shannon entropy ischanging depending on the available information, it is denoted by Huncert. inthe figure.


Figure 5.1: Provisional entropy (a) Where the string is a typical member ofthe set (b) Where the string comes from an ordered configuration.

a When the microstate is an equilibrium state (figure 2a), i.e. a typical orrandom state, the provisional entropy remains unchanged even if moreinformation reduces the uncertainty in the Shannon entropy term (la-belled Huncert. in the diagram) allowing the particular random state tobe more clearly identified. The algorithmic term increases to compensatefor the reduction in uncertainty. For such a state, the shortest descriptionof the state returns the same value as the Shannon entropy, even whenmore information becomes available to better identify the exact state.In effect, the Algorithmic Minimum Sufficient Statistic (AMSS) for thestate is the set of equilibrium states (see 4.2).

b When the microstate is ordered as in figure 2b, the extra information leadsto a reduction in provisional entropy once the extent of the ordering is


ascertained. The approach is general for any noisy description wherepattern, structure or order, is revealed with more precise measurements.This is equivalent to refining a model with further information that nar-rows down the noise or uncertainty. Li and Vitanyi [79] look at thisanother way. They describe the process of revealing order, as one whichmore accurately defines the microstate of a gas. At low resolution thereis only sufficient information to identify the macrostate and the corre-sponding provisional entropy is high. Increasing the resolution of thesystem decreases the algorithmic entropy as the uncertainty is less. Theprocess of doubling the resolution for each of the 6N axes in phase space,corresponds to doubling the number of significant (binary) figures to thespecification of the coordinate of that axis. This process will ultimatelyidentify that the configuration is an ordered one and the provisional algo-rithmic entropy falls with increasing resolution until it finally approachesthe actual algorithmic entropy. This process of more accurately deter-mining the exact configuration of the system is equivalent to successivelyreducing the number of members of the set containing the configurationof interest until the set can be reduced no further.

finding the optimum set containing the configuration. I.e this is equiva-lent to determining the AMSS for the state as discussed in section 4.2.

Comments on the Algorithmic Entropy, the Shannon Entropy

and the thermodynamic entropy

The previous sections have argued that the algorithmic entropy of a typicalstring, representing a configuration in an equilibrium set of states, is virtuallyidentical to the Shannon entropy. If there are Ω states in the equilibriumconfiguration only a minuscule number of these strings can be compressed. As aconsequence, a typical state can only be represented by a string no shorter thanlog2Ω. But to reiterate, the Shannon entropy and the algorithmic entropy, areconceptually different. The algorithmic entropy returns an entropy value for anactual state and has meaning for a non-equilibrium configuration. On the otherhand the traditional entropies return a value for a set of states, which, in thethermodynamic case, is usually the set of equilibrium configurations. Also, theShannon entropy takes significant background information as given. It doesnot specify the system in detail, nor indicate which messages, or members,are to be included in the set of possible outcomes. On the other hand, theAlgorithmic Information Theory approach must include the following.

• Specify the system including the phase space resolution or the grain sizewhere, for example, the system is a Boltzmann gas [113, 47]. This avoidsthe ambiguity of the resolution of the system in the Boltzmann approach.

• Specify the computing overheads and the relevant physical laws; i.e. thecomputational language that expresses statements equivalent to ‘PRINT’and ‘FOR/NEXT’;


• Specify the size of the strings defining elements in the system with self-delimiting instructions.

As is discussed in section 3.4, although the common information must be spec-ified algorithmically, the zero point for the algorithmic entropy can be chosento include the common information. In which case, the Shannon entropy isseen to be the particular case of the algorithmic entropy for a typical state inan equilibrium configuration.

Despite the conceptual difference, the relationship between the algorithmicentropy and the traditional entropies is sufficiently robust that one can slipbetween one description and the other, provided one allows for rounding of thealgorithmic measure and recognises the system specification requirement of thealgorithmic approach. The concept of provisional entropy conclusively links thetwo frameworks. The provisional entropy is the shortest known description of astring representing an instantaneous configuration of a thermodynamic systemin its state space, whether an equilibrium state or not. A thermodynamicmacrostate consists of a collection of microstates all with the same provisionalalgorithmic entropy.

The next section explores issues surrounding the provisional entropy andtightens the connection with the traditional entropies.

The provisional entropy as a snapshot of the system at an

instance, and the entropy evaluated from net bit flows into

the system

While the provisional entropy has the properties of a traditional entropy, anumber of questions need to be addressed to ensure the provisional entropyis used in a way that is consistent with thermodynamics. In the next fewsubsections, the following points are addressed.

• The thermodynamic entropy flows depend on the systems temperature,T , but the algorithmic entropy in bits appears to be temperature inde-pendent. How can a room temperature UTM simulate a real world UTMat 100 C?

• The definition of the algorithmic entropy requires that the algorithmhalts once the state of interest is generated. But is this sensible when thehalt requirement leads to a significantly longer algorithm? For example,the algorithm to generate the 1, 000, 000th odd integer is longer than thealgorithm to generate all odd integers but which never halts. As is arguedin the earlier discussion on real world system in Section 1.4, the criticalpoint is that the halt requirement dominates the halting algorithm neededto specify a state in an evolving system. Importantly it also ensures thealgorithmic entropy is consistent with the thermodynamic entropy.

• Nevertheless, real world computations in general do not halt as the config-urations are continually changing. The halt requirement in effects defines


the algorithmic entropy as a snapshot of the systems configuration at aparticular instance and is likely to specify a state in the set of stateswith the same provisional entropy. Where this is so, the system tricklesthrough the set of states with the same provisional entropy. Even so,this still does not explain how the entropy derived from a snapshot ofthe system at an instant is related to a system where the computation isongoing with bits entering and leaving a system but which is maintaineddistant from equilibrium.

Bullet point 1, the temperature dependence of the Thermodynamic

entropy

The thermodynamic entropy, ∆S, that flows into a system at temperature Tis given by ∆S = ∆Q/T where ∆Q is the heat flow. Where heat flows into thesystem at a constant rate, as the temperature rises, the associated entropy flowdrops. Similarly when bits flow into a system at a constant rate the temperatureof the system also rises with the input of energy. As a consequence, the energyto flip one bit must increase with the rise of kBT . So a real world, or laboratorycomputer operating at 100 C requires more energy to introduce one extra unitof information than a computer operating at room temperature. It can beseen that both entropy or information measures exhibit the same temperaturedependence.

Bullet point 2, why must the algorithm halt?

The second bullet point above draws attention to the requirement that thealgorithm that determines the algorithmic entropy must halt if the entropymeasure is to align with the thermodynamic entropy. The natural laws thatimplement the second law of thermodynamics correspond to the stored energyrepresenting the programme implementing a real world computation drivingthe system towards equilibrium.

In this situation, some bits can be identified as programme bits. Thesecarry the energy to implement the natural laws changing one state to another.Whereas other bits represent the states or configurations of the system atthe start and the finish of the computational process. I.e. one can specify theconfigurational states of a system of hydrogen and oxygen by a string specifyingelectronic states, position states and momentum states. But the bits expressingthe rules of interaction between a hydrogen and oxygen molecule, are not partof the configuration specification but instead are programme bits. These bitsembody the stored energy that implements the natural laws shifting hydrogen-oxygen states to states to H20 states. The halt requirement in effect determinesthe number of programme bits that are converted to state bits and the end ofthe computation, as is discussed in more detail later in Section 6.3. Ultimately,when the processes behind the second law of thermodynamics use all the storedenergy, the system exists in an equilibrium set of configurations as nearly allprogramme bits, except those required for reversibility, have become state bits.


In the hydrogen-oxygen system the equilibrium set of microstates is dominatedby H20 existing at a higher temperature than the initial configuration. Becauseof reversibility, collisions can dissociate the water molecules storing energy inthe separate hydrogen and oxygen molecules. But as this is counter to thesecond law of thermodynamics, the probability of significant quantities of waterdissociating by collisions is low.

Bullet 3 non-halting computations

The third of the above bullet points raises the question as to how the algo-rithmic entropy of an instantaneous configuration, measured as a snapshot atthat particular instance, is related to the bits residing in the system by track-ing net bit flows. The thermodynamic equivalent of this situation correspondsto answering the question; “What is the thermodynamic entropy of a systemwhere heat has been flowing in and flowing out but where the heat flow hasbeen stopped at the particular instant?” In the thermodynamic case, given thethermodynamic entropy at the start of the process, the thermodynamic entropyat a later time consists of the start entropy plus the net entropy change of thesystem at that time when the system can be taken to be isolated. In whichcase, if ∆Qinout represents the net heat flow in and out at temperature T , thethermodynamic entropy in S, corresponds to ∆Qinout/T . However, knowingthe heat energy initially in the system, and tracking heat flows in and out, onecan calculate the heat in the system at the time when the system has reacheda particular state. For heat inflow, disorder increases at the boundary be-tween the system and the environment, and then diffuses through the system.Similarly for outflows, ∆Qout/T orders the system initially at its boundary.

The net entropy change at the time of interest is equivalent to isolating thesystem, i.e. halting the entropy flows at that time. This argument based onheat flow works as the equation specifying changes in the thermodynamic en-tropy is an exact differential ensuring the thermodynamic entropy is a functionof state and the path to reach that state is irrelevant. One can articulate asimilar argument for the algorithmic entropy using bit flows rather than heatflows. But as it is not obvious that the algorithmic entropy is path indepen-dent, an argument may well be needed to show that the entropy evaluated bytracking bit flows is also path independent and therefore aligns with the ther-modynamic entropy. As was discussed in section 3.2 extra bits are needed toconnect subroutines which suggests a complicated path may need extra bits.The following shows that, provided the path is a reversible and the computa-tion on-going, as the process does not halt in an intermediate state, a pathconnecting subroutines by tracking bit flows gives the same result as the short-est direct description of the state. It is shown later that this is of particularsignificance for a system that is maintained in a homeostatic or stable stateby the inflow and outflow of bits to counter the degradation processes of thesecond law of thermodynamics.

It is to be shown that: given the initial algorithmic entropy, the entropyestablished by tracking the bit flows in and out of a real system up to a given


instant is the same entropy value as that given by the algorithm that generatesa particular state and then halts at that instant.

Consider a system where the initial algorithmic entropy of state s0 at timezero is specified by H(s0) configurational bits, and a programme p∗ that ac-cesses a resource string σ to generate a subsequent state ηk from s0. Theprogramme bits in p∗, that implement the natural laws defining interactionsbetween species, as mentioned earlier, are distinct from the bits that specify theconfigurations σ and s0. Some of these programme bits in p∗ may well repre-sent the computation that degrade the initial state under second law processessuch as happens when hydrogen and oxygen ignite to form water. The residualprogramme bits needed to capture reversibility are denoted by g [114]. I.e. theinstructions to drive ηk back to p∗, s0, σ. In which case, the computation thatgenerates a subsequent state from a start state s0 is given by,

U(p∗, s0, σ) = ηk, g.

If no information is lost, the algorithmic entropy of the configuration of thefinal microstate state at this instant is,

H(ηk, g) = H(σ) + |p∗|+H(s0).

Here g are the number of its to ensure reversibility turn η into back to s0.Heat exits a real world system because the external environment in effect

undertakes a computation that transfers some of the bits specifying the mo-mentum states in the system to bits specifying the momentum states in theenvironment. This can be visualised as a cold wall surrounding the systemabsorbing energy by collisions with the wall, or by bits increasing the momen-tum states in the environment as radiation escapes through the transparentwall. Similarly when species exit the system, the computation that effects thistransfers both programme and state bits outside the system. This computa-tion, shifting bits to the environment, effectively separates the configurationηk, g into two clearly defined strings that here will be denoted by ǫi, g

′ and w.The first terms refers to the entropy of bits remaining within the system whilethe second exits as the waste because of the separation. As the bits remainingin the system need to align with the entropy defined by a reversible haltingalgorithm,the reversible bits g′ in this algorithm must be included in the de-scription of the final state. Although being reversible and H(ǫi) = Hǫi, g

′) forthe overwhelming majority of cases, the g′ needs to be identified to be sureeverything is accounted for. This computation undertaken by programme q∗

can be represented as;

U(q∗ηk, g) = ǫi, g′ + w

As the bits remaining in the system and those ejected are included;

|q∗|+H(ηk, g) = |q∗|+H(σ) + |p∗|+H(s0) = H(ǫi, g′) +H(w)


Hence H(ǫi, g′ the measure obtained by a halting algorithm is given by

H(ǫi, g) = |q∗|+ h(σ) + h(s0)−H(w).

This confirms that, when programme and residual programme bits are treatedconsistently, the algorithmic entropy remaining in the system equals the initialnumber of bits plus the net bit flows into or out of the system.

The following conclusions summarise the above result.

1. If one can properly track entropy flows given the initial algorithmic en-tropy of the system, the algorithmic entropy of the final state at a par-ticular instant, can be determined directly from the net bit flows, ratherthan by finding the number of bits in the algorithm that generates thefinal state.

2. Tracking bit flows between routines enables the net change to the bitsin the system to be identified. Hence, if the algorithmic entropies aredefined in terms of bit flows, in contrast to the situation where haltingalgorithms are specified, bits that pass from one routine to another donot need the linking instruction 3.2. If an on-going computation places asubsystem in configuration that enables bits to be accessed as an inputto another subsystem, the corresponding computational halts when bitsexit the subsystem and therefore no instruction to link the output to asecond subsystem is needed. I.e. in an on-going real world computation,outputs of subroutines are placed in a position that they are accessibleto other routines

3. The greater H(w) is for a given input, the shorter the snapshot or haltingalgorithmic description and, as the system is further from equilibrium,the more ordered it is. Interestingly, Schneider and Kay [92] have mea-sured the energy dissipation for an ecology and have shown that the morebiologically complex the ecology, the more effectively it processes inputsand, as a consequence, transforms more of the stored energy into heat.The waste is more degraded, in the sense that the bits originally spec-ifying the programme embodied in states that store energy, are morecomprehensively transformed into bits specifying the momentum statesand which can be ejected mainly as heat.

4. Furthermore, given the start entropy, the net bit flows from Landauer’sprinciple correspond to the net thermodynamic entropy change to thesystem. A simple thought experiment reinforces the connection betweenthe algorithmic entropy and the thermodynamic entropy. Effectively thedifference in algorithmic entropy between two states is about the trans-fer of bits of information, whereas the thermodynamic entropy is aboutthe transfer of energy. When ∆H bit are transferred between two statesin a computational system at a temperature T the energy transfer iskBT ln2∆H from Landauer’s principle [74]. Consider states i and o that


belong to different macrostates. While the system is isolated in eitherof these states, each is a typical configuration in the macrostate. Theminimal possible number of bits that must be transferred into or outof the system to change state i to state o is |o∗| − |i∗| [13]. As this isHprov(i) − Hprov(o), the energy inflow required to set the bits in thenew configuration in a real world computational system at temperatureT is kBT ln2[Hprov(o) − Hprov(i)] (see section 6). From a thermody-namic view, the heat flow ∆Q needed to shift the real world system attemperature T from the set of microstates containing i to the new equi-librium macrostate containing the state o, will be identical to the energyflow needed change the bit settings in shifting from i to o. As a conse-quence, the thermodynamic entropy difference between the states is also∆Q/T = kBln2[Hprov(o) − Hprov(i)] (cf [114]). An example that illus-trates the point is seen for the difference between a block of ice at 0 K, andmelted ice at the same temperature constrained to the same shape. Giventhe original ice configuration, the bits that need to enter the system tospecify the melted configuration is ∆Q/(kBT ln2). Thus the thermody-namic entropy difference between the two equilibrium states is identicalto the algorithmic entropy difference in bits multiplied by kBln2. Notonly are the two entropy differences identical for equilibrium states, al-lowing for units, the provisional entropy has meaning for off-equilibriumor non-typical configurations.

Chapter 6

Reversibility and computations

A historical difficulty with the statistical approach to thermodynamics is il-lustrated by the thought experiment involving an informed agent, commonlyknown as Maxwell’s demon, who is able to extract work at no cost by judi-ciously manipulating the particles of a gas in a container. One version of thethought experiment makes the argument that a container of gas can be dividedinto two halves, both at the same temperature. Maxwell’s demon opens a trapdoor in the dividing wall to allow faster moving particles to collect on one sideof the container and closes the door to stop the faster particles returning, thuskeeping the slower moving particles on the other side. Over time, the sidewith the faster moving particles will be at a higher temperature than the sidewith the slower moving particles. It appears that work can then be extractedviolating the second law of thermodynamics.

Initially it was understood (see Szilard [103, 102] and Brillouin [14]) that, asthe demon needed to measure the position-momentum coordinates of a particleto know when to open the trap door, it was necessary for the demon to inter-act with the system under study. It was then argued that the measurementprocess would require an entropy increase corresponding of kBln2 for each bitof information obtained.

Landauer [74] showed that this explanation was not completely satisfactoryas a thermodynamically reversible process cannot lead to an overall increase inentropy. Landauer pointed out that it is the irreversibility of the process thatis critical. This irreversibility is reflected in the logically irreversibility of thecomputation undertaken by the physical system. This happens when informa-tion about the prior state is not available, either because information has beendiscarded, or because there is more than one prior state. As a consequence,when the equivalent computational process becomes logically irreversible thesystem is no longer thermodynamically reversible. Landauer’s approach re-solves the paradox of Maxwell’s demon - the intelligent interventionist whootherwise would be able to extract work for no cost. Bennett [10, 11, 9, 12]and Zurek [114] show that the demon cannot violate the second law of thermo-dynamics as the demon must erase the information that was gained in making

87

88 CHAPTER 6. REVERSIBILITY AND COMPUTATIONS

a measurement, in order to make room for the next measurement. The demonwill need a continuous supply of energy to flip new bits resulting in an increasein its temperature requiring heat to be passed elsewhere if the process is tocontinue. The demon is in effect part of the total system and constrained byphysical laws. Bennett (see also [114]) illustrates the argument with a gas ofone particle and uses the demon to trap the particle on one side of a partitionin order to extract work. The entropy loss occurs when the demon memoryis reset to allow the process to cycle. However, there are subtleties as Ben-nett [12] points out. Where information initially is random, the entropy ofthe system drops if the random information is erased, but the entropy of theenvironment still increases. One can envisage this erasure as setting all bits tozero, effectively ordering the system and transferring energy outside the sys-tem. The key papers expressing the different interpretations of the Maxwelldemon problem, and a discussion on the “new resolution” involving erasureand logical irreversibility, are to be found in Leff and Rex [76]

Implications of an irreversible computation that simulates a

reversible real world computation

This principle of Landauer has important implications not only for reversibility,but also for AIT, the second law of thermodynamics and natural processes thateffectively regulate the system maintaining it in a homeostatic configuration.However there is an important difference between a real world computationand that on a typical reference computer. The generalised gates in the realworld computer (i.e. the atoms and molecules and computational elements thatchange their states under the computation while passing information throughthe system) are reversible. Reversibility is hardwired into the computationalprocess through the physical laws that determine the computational behaviourof the components such as atoms and molecules. The input string to the com-putation is stored in the initial states of the atoms or molecules at the startof the process. Given the initial state, the physical laws, which determine theinteraction between the states of the species in the system, drive the trajectoryof the system through its state space. Similarly, as Landauer [75] points out,any conventional UTM is itself constrained by physical laws and is also a realworld device. In this case, the input and the output are not the digits writtenon an input/output device such as a tape, but the initial configuration and finalconfiguration of bits stored within the computer at the start and finish of thecomputation. The computation that specifies how the system steps through itsbit space is determined by the initial bit settings which includes the programmethat manipulates the physical laws embedded in the gates [74, 75, 104]. How-ever, the AND and OR gates of the typical reference UTM are irreversible. Thesimple ballistic computer of Fredkin and Toffli [55] illustrates how a physicalprocess can act as a reversible computer. However a more relevant examplethat illustrates the relationship between a physical or chemical process and acomputational process is the Brownian computer of Bennett [10, 9]. This is thecomputation embodied in the process by which RNA polymerase copies a com-

89

plement of a DNA string. Reversibility is only attained at zero speeds as thecomputation randomly walks through possible computational paths. Indeed,because the process is driven forward by irreversible error correction routinesthat underpin natural DNA copying, the process is no longer strictly reversible.

While it is possible to construct reversible gates, [55], in practice when an ir-reversible reference UTM is required to simulate a real world UTM, the historyof the computation that contains, the information needed for reversibility mustbe kept and tracked. As the next section outlines, Bennett [9] lists the stepsrequired to generate the initial state from this computational history. Thatthis is necessary can be seen with the example of oxygen and hydrogen ignitingto form water and in so doing increases the system’s temperature. If nothingescapes, this is a reversible process. Given the initial state of the system, theprogramme on the reference UTM that defines the algorithmic entropy of thefinal state of the system must also specify the final momentum states, not justthe atomic and molecular species. These momentum states contain the historyneeded for reversibility. If heat is irreversibly lost, lowering the temperatureand leaving water behind, the conventional computer must lose bits associatedwith these momentum states in order to simulate the final configuration of thereal world system. This is such an important point that it will be revisitedthroughout the book. In a nutshell, the state space configurations of the realworld computer is encapsulated in what Landauer [75] and Bennett [12] callthe Information bearing Degrees of Freedom (IBDF). The bit space configu-rations of the reference UTM capture the same information as the real worldcomputer that it simulates, provided the history information in the referenceUTM is tracked properly.

The Landauer principle, outlined above, clarifies the essentials of a com-puting process undertaken by a physical system. The input string to the com-putation is stored in the states of the atoms and molecules at the start of thephysical process. In an isolated computer system, bits are conserved as energyis conserved. Because the Hamiltonian dynamics of the system conserves infor-mation, whenever a process becomes irreversible, information is not conserved;i.e. when information is lost, kBln2 entropy units per degree of freedom mustbe transferred from the system with at least a corresponding entropy increasein the environment. As Landauer showed, it is not the cost of measurement ofthe information that is critical, but it is the energy cost that arises when infor-mation leaves the system by being transferred to the external environment. Itis this that causes irreversibility. Bennett [10, 11, 9, 12] and Zurek [114] takethe argument further. Discarding 1 bit of information from the IBDF at a tem-perature T , contributes about kBT ln2 joules to the environment. Both logicaland thermodynamic irreversibility follow as information is removed from thecomputation. This leads to a reduction of entropy of the system matched byan increase of entropy to the universe. Irreversibility can occur when heat istransferred out of the system or when material escapes carrying the informa-tion in its “information-bearing” degrees of freedom. A transfer of informationinto the system also destroys reversibility unless the transfer process retainsreversibility. Chapter 8 of Li and Vitanyi [79] review relevant issues related


to reversibility and the thermodynamics of real world computing. Figure 6.1is a schematic diagram of different regions of the bit space in a conventionalcomputer. These regions are more or less distinct in a conventional computer,while in real world computations the programme region, which expresses phys-ical laws, is not obvious as the information is embodied in the interactionbetween the species in the system. In a conventional computer, the bit set-tings in the programme region under the programmer’s control determine thetrajectory of the system in bit space as the input moves to the output. Al-though, once an input and the programme are encoded as “on” and “off” bitsin a computer, the distinction between the programme and the input becomesarbitrary. A more rigorous argument that demonstrates this point for a simpleUTM is shown in Chaitin [25].

For similar physical situations (as was outlined in section 1.2 bullet point2), the contribution of the common routines to the algorithmic entropy can beignored. As only entropy differences are significant the common routines suchthe relevant physical laws and the specification of the system cancel. This isextremely fortunate, as the specification of all the natural laws would be animpossible task. In effect one can set an entropy zero that takes most laws asgiven and only considers those laws that are specifically needed to describe thechanges of the states.

6.1 Implications for Algorithmic entropy and the secondlaw

Physical processes are in principle reversible and can map on to a logically re-versible TM that operates on the input string (representing the initial state) toproduce an output string (representing the final state). The computation takesplace through the operation of the physical laws determining the trajectory ofthe system through its states. However in practice most real world systemsare open systems and information and material may enter or leave the system.As is mentioned in the previous sections, when information is removed, eitherbecause energy is removed, or because species leave carrying information, thecomputational possibilities of the system are altered.

Nevertheless, reversible and irreversible processes cannot be compared onthe same reference computer unless, for the irreversible case, the informationrequired to maintain reversibility is is tracked. This follows from the Landauerprinciple, where any computational process, whether on a conventional UTMor real world computer, cannot be reversible if information about prior statesin the computational process is discarded.

However reversibility can be retained in an irreversible computer if the in-formation needed for reversibility is stored in the computer memory. The infor-mation kept is the computational history. Reversibility then becomes possibleif this information can be inserted at critical steps in the reverse computingprocess. The real world equivalent is where the information about to be passedto the environment is kept maintaining reversibility. An example is the heat

6.1. IMPLICATIONS FOR ALGORITHMIC ENTROPY AND THESECOND LAW 91

Figure 6.1: Schematic of a computational system


produced when ice crystallises from water. If this heat is kept the process isreversible, but if it is passed to the environment reversibility is lost.

The usual non reversible UTM describes a non reversible physical processwhen the history of the computation is lost. Nevertheless, a reversible processcan be mapped on to a non reversible UTM by using an algorithm that worksin both the forward and the reverse direction, inserting the information thatotherwise would be lost at the irreversible computational steps [80, 9, 113]. AsBennett [9] shows in detail, such a reversible program broadly speaking has thecapability to:

• Generate the output from the input

• Copy the output

• Reverse the computation to remove cumulated history and regenerate theinput.

• Swap the regenerated input and the copied output to allow the reversecomputation.

The question arises as to how the algorithmic entropy H(o|i), which is usuallyderived from the shortest irreversible computation that describes the output ogiven i, is related to the reversible computation. Let KR(o|i), or its equivalentKR(i|o), be the length of the shortest reversible computation that maps thesame physical process. Because it is likely that extra information is needed tomake the process reversible,

H(o|i) ≤ KR(o|i) or KR(i|o).

The algorithmic entropy is given by the minimal reversible programme p∗ thatgenerates the output from the input on the reference UTM. Its size can be nolonger than the reversible program that maps the reversible physical process.

However the program implicit in the Brownian computer described by Ben-nett [10, 9] may be able to be shortened if a path involving, say, a catalystis replaced by a more direct computing path. The speed of the process with-out the catalyst might be less, but the outcome would be the same. Severalauthors [15, 80, 110] have considered the trade off between computer storageand number of computing steps to reproduce a given output. The shorter thecomputation the more information storage is required to achieve the reversiblecomputation. For on-going computations, the information stored as the historymust be erased. This can only happen if entropy is lost to the environment,making the process more difficult to reverse.

The difference in algorithmic entropy between an input state i and andoutput state o measures the number of bits that must enter or leave the systemwhen i shifts to o. This entropy difference is independent of the UTM used.If the system is to be restored form o to i, the flow of bits must be reversed.The thermodynamic cost of maintaining order or restoring a system, as bitsare added or discarded, is kBln2 per bit from Landauer’s principle [74]. This is

6.1. IMPLICATIONS FOR ALGORITHMIC ENTROPY AND THESECOND LAW 93

discussed in more detail in the next section. The consistency of the argumentcan be seen by considering the following examples.

• An adiabatic expansion of an ideal gas against a frictionless piston, in-creases the spatial disorder. The movement of the piston in effect is like aprogramme, changing the computational trajectory of the components ofthe gas. The algorithmic entropy is derived from the shortest descriptionof the instantaneous configuration of the gas in terms of the position andmomentum coordinates of each gas particle. In algorithmic terms, whilethe contribution to the algorithmic entropy of the position coordinatesincreases through disordering, the algorithmic entropy of the momentumcoordinates decreases to compensate. This decrease corresponds to a dropin the temperature. Adiabatic compression is the converse process. Adi-abatic demagnetisation is analogous to adiabatic expansion. Adiabaticdemagnetization randomises a set of aligned spins as an external mag-netic field is reduced. The disordering soaks up the entropy embodied inthe thermal degrees of freedom lowering the system’s temperature. Whilea system remains isolated no erasure of information, or total change inentropy occurs.

• On the other hand, an isothermal expansion of a gas against a piston,can disorder or randomize a system as heat enters to do work againstthe piston. In contrast to the adiabatic case, the algorithmic descriptionof the system increases as heat flows in. The thermodynamic entropyincreases by kBln2 per bit change in the algorithmic description. Con-versely, isothermal compression orders the gas. For an isothermal mag-netic system, the equivalent of isothermal compression is where externalmagnetic field is applied to aligns spins and in doing so passes kBT ln2joules per spin to the environment.

• In an open, real world system, the computational processes that resetthe states of the computational elements (i.e. the atoms, molecules,spins etc.) require energy to be expended or released. This may oc-cur through electrical energy; through magnetic field gradients that alignspins; through photons and molecular structures that enter or leave thesystem; through heat flows into or out of the system; and finally throughwork being done on or by the system. Energy sources within the sys-tem can also be redistributed to reset the computational states as, forexample, when hydrogen and oxygen are converted to water by these realworld computational processes. In the hydrogen oxygen case, the energyreleased is passed to the kinetic energy states of the system. Providedthe system is isolated there is no entropy change. The process increasesthe length of the algorithmic description of the momentum states, whilereducing the length of the algorithmic description of the position coor-dinates. As one water molecule replaces two hydrogen and an oxygenatom the description of the species also changes. The physical (or com-putational) processes within the system disperse the entropy. When the


system is no longer isolated, regions that are disordered relative to theenvironment, such as those embodied in the kinetic energy degrees offreedom, can then pass excess entropy to the environment leaving themore ordered regions behind.

Ordering only occurs when entropy is ejected overall - for example whenheat is passed to a low entropy sink in the environment or high entropymolecular fragments escape the system. The programme on the referenceUTM that specifies the states must capture the same information as thereal world UTM. When material enters or leaves a real world system,expanding or contracting the state space, the corresponding process inthe reference computer involves expanding or contracting the bit space.This could correspond to bits passing to the computer memory area cor-responding to the environment. Alternatively it could occur through theconnection or removal a memory device such as a memory stick. Forexample when species enter a real world system, altering the trajectorythrough its state space, the equivalent for the reference computer can beenvisaged as a programme change brought about by adding programmebits via a memory stick.

6.2 The cost of restoring a system to a previous state

As has been mentioned, the degradation processes of the second law of thermo-dynamics can be visualized as a computing process. The input string i specifiesthe information in the system, the information in the nutrients and any sourceof concentrated energy. A programme string p specifies the algorithm embody-ing the physical laws, which have not been made explicit, but which drive thesystem to a more disordered configuration. The output string o is generatedby the computation on computer U where U(p∗, i) = o. If all the informationis kept in the computation; i.e. nothing escapes the system, the process isthermodynamically and computationally reversible.

However, the process becomes irreversible when the system loses infor-mation through heat transfer, or through material exiting the system. Thisinformation must be replaced if the system is to be restored to its originalmacrostate. This means that a living system, must maintain itself in a far-from-equilibrium configuration by offsetting the degradation implied by thesecond law of thermodynamics. Section 1.2 introduced Chaitin’s degree of or-ganisation Dorg which is virtually the same as the degree of order Dord themeasure of how far a system is from equilibrium. If there are Ωset states in theequilibrium configuration, the algorithmic entropy of a typical string will beclose to log2Ωset. Denoting the algorithmic entropy of the ordered string by sby H(s), Dord = log2Ωset −H(s). This distance from equilibrium of a systemcan only be maintained by the injection of appropriate order and the ejectionof disorder.

Zurek [114] has applied the argument of section 6 to quantify the cost ofrestoring a system to its input state i when information has been discarded to

6.2. THE COST OF RESTORING A SYSTEM TO A PREVIOUS STATE95

produce the output state o. He has shown that the minimum entropy passedfrom the universe to the physical system in an irreversible process is

|i∗| − |o∗| = H(i)−H(o) = H(i|o∗)

must be returned to the system [114], noting here that H(i) represents theinformation content of the initial state of the system and H(o) that of the finalstate. If the system is originally ordered and the final state more disordered,H(i) < H(o) and entropy must flow in to generate the final state o. In whichcase, the initially ordered system can do work, as happens when a heat inputforces a gas to do work against a piston. Zurek [114, 113] argues that if one knewthe exact description of these states, a cyclic process based on this informationwould be maximally efficient for extracting work. While Zurek [113, 114] hasalso shown that a slightly stronger form of the restoration process involvesreturning H(i|o) bits to the system, the weaker version, generating i from o∗

provides a lower bound to the cost of restoration. The weaker version is moreuseful, as it is anti symmetric and can therefore track the entropy gains andlosses around a thermodynamic cycle, allowing for the directions of entropyflow (Appendix A, [13]). Furthermore, the restoration path does not need tobe the exact reverse computation of the forward path.

Consider a system that was originally in a microstate ǫi belonging to astable set of states ǫi, ǫj etc., all of which have the same provisional entropye.g. Hprov(ǫi). The disturbance δ switches the trajectory of the system to anew configuration ηj outside the stable set. This is analogous to plugging amemory stick containing a computer virus into a computer. The correspondingcomputation p∗δ effected by the disturbance or virus is U(p∗δ , ǫ

∗i ) = ηj . The

number of bits that enter the system due to the disturbance is

|p∗δ | = H(ηj)−Hprov(ǫi).

Here a positive sign represents an entropy inflow. The measure is an en-tropy difference and is the same whether on a real world computer or a refer-ence UTM. Applying Zurek’s argument over a cycle, at least Hprov(ǫi|η∗j ) =Hprov(ǫi) −H(ηj) bits must be returned to the system to restore it to one ofstates in the original stable set. This leads to the thermodynamic requirementfrom Equation A.1 that at least

kBln2H(ǫi|η∗j ) = kBln2[Hprov(ǫi)−H(ηj)] (6.1)

Joules per degree must be introduced into the system to restore i. The real costof course may be higher as systems are seldom thermodynamically efficient.

Ashby [3] has argued that an open system must have the regulatory ca-pability to adapt to a changing environment to maintain homeostasis. Froma cybernetic point of view, an external regulator is required to maintain thesystem’s variables within a viable region. This approach leads to Ashby’s lawof requisite variety, which states the requirement for a system to regulate isthat the Shannon entropy of the set of regulation states must at least equal the


Shannon entropy of the set of disturbances [3] [20]. From a computational pointof view, Zurek’s argument above shows that where an external regulator passesinformation in and out of the system, a law similar to Ashby’s law of requisitevariety emerges. Here the law becomes; “The algorithmic entropy introducedby the correction process must equal the algorithmic entropy introduced by thedisturbance”.

This point of view considers the disturbance and the regulator to be externalsystem and both act on it. However when a system is distant from equilibrium,the second law processes that drive the system towards equilibrium act as thedisturbance but in this case they are part of the system. Similarly, wherea system self-regulates, the regulation process acts within the system. Forexample a forest can maintain itself by utilising photons from the sun. Ifthe light is blocked the forest will degrade to a local equilibrium. Section 9.4generalises the argument discussed above for the situation of self-regulatingsystems operating under second law degradation processes.

6.3 The algorithmic equivalent of the second law ofthermodynamics

The second law of thermodynamics requires that the entropy of an isolatedsystem increases to a maximum as the system’s trajectory moves from a non-equilibrium configuration towards an equilibrium one. The Boltzmann ap-proach explains the entropy gain by the increase in the system’s available statesas the energy becomes uniformly dispersed. Ultimately at equilibrium, whenthe number of available states is Ω, the Boltzmann entropy becomes kBlnΩ.On the other hand, the algorithmic approach identifies the entropy gain as theincrease in the length of the algorithmic description of the microstate of thesystem as its state trajectory moves from an ordered configuration to a typicalor equilibrium one. This increase is, in effect, due to the number of programmebits embodied in stored energy at the start of the computation that becomethe bits to needed define the more disordered configuration at equilibrium. Asis discussed later, at equilibrium the history of the trajectory is lost, and theoverwhelming majority of states require an algorithmic description of lengthlog2Ω+O(log2Ω) where the second term allows for self-delimiting coding. Thecorresponding algorithmic entropy is close to Ω, a value that aligns with theBoltzmann entropy of kBlnΩ for the equilibrium configuration.

The argument, as is discussed later, is not as straightforward when a re-versible and isolated system is being simulated on an irreversible referenceUTM as reversibility requires that no information can be lost, and each statemust have a unique precursor. In essence, the system is at least quasi ergodic,and the system’s trajectory will access all Ω states in Ω steps allowing theevolution of such a system to be tracked. The algorithmic details of the secondlaw processes that drive an isolated, real-world classical system from an initialordered state s0 through intermediate states to settle in an equilibrium statehas been developed in Appendix C in the article by Zurek [113].

6.3. THE ALGORITHMIC EQUIVALENT OF THE SECOND LAW OFTHERMODYNAMICS 97

Assume a finite but large discrete set of accessible states, or alternativelydefine a discrete set of states by choosing an appropriate degree of resolutionin the phase space. Consider an isolated reversible natural system where theinitial state s0 is the most highly ordered state of the system, and its algo-rithmic entropy is H(s0). Given the initial state s0 let th be the number ofcomputational steps needed to halt at the particular output state sh. The realworld computer that shifts s0 to sh has the physical laws defining the trajectoryembodied in atoms and molecules that behave as generalised gates, the naturalequivalent of computational gates. The trajectory of the system in the naturalworld can be tracked by a reversible algorithm that simulates the physical lawson the reference UTM. As each step in the evolution of the system is reversible,the reversible algorithm that maps the physical laws on the reference UTM toproduce the final or halt state sh can be expressed in the following schematicform (e.g. see Zurek, [113]):

STATE = s0

FOR STEP = 0 to th

Compute next STATE.

NEXT STEP. (6.2)

For example the initial ordered state could be an isolated system of hydrogenand oxygen. Under the right conditions, ignition takes place driving the to-wards the equilibrium set of states. This set of states for most of the time willbe at a higher temperature than the initial configuration and will be dominatedby water as steam, with traces of hydrogen and oxygen. However where thecomputation halts before reaching equilibrium, the algorithmic entropy of thehalt state sh is the length of the shortest algorithm that generates the haltstate. But, as this is a natural process, reversibility is implicit. Where thereversible trajectory is mapped by an irreversible reference UTM the historyof the computational process must be kept to ensure reversibility. Providedthe reversible history is kept, the total entropy before and after a second lawprocess, is constant. As is discussed later, from an observer perspective, theapparent increase in entropy arises as programme bits embodied in stored en-ergy of the molecular species, enter the configurational states of the systemcausing the computer to halt after th steps.

The algorithmic entropy of the reversible system is expected to consist ofthe entropy of the initial state, the length of the routine H(th) that specifiesth, the halt requirement, and the entropy in the routines that compute thenext state. If the binary algorithm that generates the trajectory is the shortestpossible then the algorithmic entropy of the halt state after th steps is givenby;

H(sh) = H(s0) +H(th) +H(Compute next STATE) +O(1). (6.3)

As the system is reversible there can only be one forward algorithm tofully describe the halt state including the history that is needed to maintain


reversibility. Therefore Equation 6.3 must have an equals sign, as the entropyof the halt configuration, with the computation’s history, is equal to that ofthe initial configuration including the programme. Similarly, as discussed insection 6, when the algorithm is run on an irreversible UTM and the historyis kept, no entropy is lost to the environment and the process is logically andthermodynamically reversible [10, 9, 114]. The bit settings of the initial pro-gramme and initial state can then be reconstituted from the halt state givenits history [9, 11].

Furthermore, as science is predicated on the assumption that the naturallaws are simple and few, the subroutines expressing the physical laws (i.e.compute next STATE) are expected to be highly compressed relative to thesize of the overall programme as th grows. In most situations, the number ofsteps is extremely large and the dominant contribution to the length of thealgorithm implementing Equation 6.2 to give H(sh) of Equation 6.3, is thelength of the specification of th, i.e. H(th). This is because the entropy of theinitial state, H(s0), is also negligible in relative terms. As the reversible systemsteps through the allowable states following equation 6.3, the value of H(th)on average hugs the log2th curve (ignoring the higher logarithm terms requiredfor self-delimiting coding). While the length of the dominant contribution toEquation 6.3 grows on average as log2th, there are many fluctuations whereH(th) drops below this, as figure 3.2 in Li and Vitanyi [79] indicates. This isbecause shorter descriptions of the integer th exist for particular values.

While the above seems straightforward, the following four issues are clarifiedin the following sections.

1. It is important to note that the programme that defines the algorithmicentropy must halt by definition, as the programme must specify the con-figuration exactly. In an isolated system there can be no entropy increaseas the system moves reversibly from an apparently ordered configurationto the halt state. However, because real world computations do not halt,the entropy defined by the that algorithm can be established by trackingbit flows as is discussed in Section 5.2.

2. Yet there is a paradox . Equation 6.2 implies the entropy is increasingas the reversible system moves from s0 to sh. But there cannot be anentropy increase, as the entropy of a natural process must be the samealong a reversible path, where there is no entropy exchange with theenvironment [10, 9, 114]. The paradox is resolved in the discussion belowwhere it is shown that the programme bits introduced into the system atthe start of the computation, are stored as the energy driving the laws ofinteraction between species in the natural world system. While these bitsare distinct from the bits defining the configurations of s0 and sh (see6.3 below), they strictly should be identified as part of the initial settingof the system. However the observer primarily identifies order in termsof the observed configuration in the state space, and unless all bits arebeing tracked, programme bits are ignored.


3. Under what circumstances is the programme of equation 6.2 the shortestpossible programme to generate the halt state sh? I.e. when does itslength represent the algorithmic entropy of sh?

4. How does the algorithmic entropy that increases through the trajectoryrelate to the equilibrium entropy?

5. Is the algorithm based on the genetic code that generates a structuresuch as a tree by mapping the tree growth, shorter than one that justspecifies the final state of the tree by defining the position of each celland the detailed cell variations in the final state?

The next sections discuss these issues.

The apparent entropy increase of a isolated system

As has been mentioned, computations in the real world are driven by the natu-ral laws that define the rules of interactions between the atoms and molecules.One can envisage these species acting as generalised computational gates orcomputing automata. However, the programme, specifying the interaction be-tween the states in terms of physical laws, implemented on the laboratoryreference computer, is defined separately from the bits representing the con-figurations of the system. If ph represents the programme bit settings thatgenerate the halt state given s0, the shortest description of the initial con-figuration, which consists of the programme and the so called initial state, isstrictly (ph, s0)

∗ not just s∗0. The initial state can just as easily be part of theprogramme, as the computer does not distinguish programme states and theinitial state. The programmer only manipulates the natural laws defining thecomputer behaviour so that, given the programme and the initial state, naturallaws underpinning the operation of the computer gates generate the requiredoutcome when the computation halts.

However, if the irreversible laboratory reference UTM is to adequately sim-ulate a natural computation, reversibility must be built in to the laboratoryUTM. Otherwise, the algorithm, given by Equation 6.2, would imply the en-tropy increases. Provided the bits introduced by the programme (which in-cludes th, the number of computational steps to the halt configuration) areaccounted for and also the history bits are kept ([114]), reversibility is main-tained. These bits needed to be included as part of the algorithmic entropyof the total system. The history bits are in essence the programme instruc-tions that can be used to reverse the computation. As is discussed above, theprogramme bits that define the halt state are the major contribution to the ap-parent increase in entropy of an irreversible computation. This understandingis implicit in the discussion by Zurek [113] and Bennett [9], [11].

As the programme unfolds, it adds bits that were originally programmebits to the successors of the initial state, finally terminating at sh. Once theprogramme bits, the bits in the initial state, and the history of the computationneeded to maintain reversibility as part of the halt configuration, are taken


into account, the total number of bits is conserved. As Zurek [114] argues,reversibility can only be achieved by filling up the computer memory withhistorical records [9, 10, 74].

As the observer identifies order in terms of the configuration of the statesof the system, the system becomes more disordered as programme bits becomepart of the bits needed to define the output state and the history of the compu-tation. The example of an isolated system of hydrogen-oxygen system, whichfrom an observer perspective might be seen as ordered, illustrates the issues.Ignition initiates the computation on the separate molecules of hydrogen andoxygen that adds programme bits to the configurational bits representing thestates of the system, generating more water molecules and raising the tem-perature at each computational step. This is because, from the perspective ofa natural world computation, the stored energy in the original gases definingthe programme states diffuses through the system, disordering the momentumstates at each hypothetical stage. From the point of view of the states of theconstituents, the system becomes more disordered as programme bits becomepart of the bits needed to define the high temperature halt state.

As a consequence, when the programme information is taken into accountand when the information later appears as the history of the computation,the total number of bits does not change, but this information is not usu-ally identified as part of the system. This information, which is embodied inthe natural laws describing the interactions between states, is easy to over-look from an observer point of view. From an observer perspective let sh bethe higher temperature halt state of the water system trajectory of equation6.3. This state includes the quantum states of the hydrogen, oxygen and watermolecules together with their rotational and vibrational states. However, whenan irreversible UTM simulates this real world system, the history of the com-putation, corresponding to the programme states embodied in physical laws,are still in the system as part of the full halt state sh, but are not part of thesh, - the state of interest to the observer. This is the reason why the historyinformation is not usually identified as part of the system. Where reversibilityis maintained the halt state sh actually = sh, g(s

∗0, sh) where g(s∗0, sh) is the

history of the computation needed to maintain reversibility. The initial con-figuration can be reconstituted from the halt state sh using the computationalhistory, g(s∗0, sh), as a programme by running the computation back in time[9], [11]. While the observer sees sh as the final state, for the overwhelming ma-jority of states, sh and g(s∗0, sh) are generated together and, as outlined in thenext few paragraphs, for the majority of states H(sh) = H(sh) . With thesepoints in mind, the following theorem captures the essentials of this process.

Theorem: In an isolated, reversible, discrete system, the algorithmic entropythat enters the system as the initial configuration and the programme with itshalt instruction must be conserved, appearing as the entropy of the computedfinal state and the entropy embodied in the computational history.

The formal proof of the above theorem recognises that when an irreversiblereference UTM is used to map the reversible process generating sh from s0,reversibility must be maintained. This is conveniently recognised by distin-


guishing the state space configuration of each halt state sh from the history ofthe computation g(s∗0, sh), or g for short at each irreversible step [114]. I.e. letsh = (sh, g). The history bits g are the residual programme bits that are notpart of the configuration sh, but are needed for reversibility. The initial con-figuration to be reconstituted from the halt state sh, on the irreversible UTMby using the computational history, g, to run the computation back in time[9, 11]. As has been mentioned, while the programme ph represents the bitsettings that generate a particular halt state given s0, the laboratory UTM im-plements the physical laws embodied in its gates to generate the output statefrom the initial configuration that includes ph and the initial state s0. Thecomputation that reversibly generates the halt state on the reference UTM is

U(ph, s0) = sh, g.

As no entropy is lost to the environment, the process is logically and thermo-dynamically reversible [10], [9], [114]. In which case entropy is conserved asH(phs0) = H(sh, g). Here H(sh, g) is the entropy of the final configurationof the computation sh that includes both sh and the computational historyg, necessary for reversibility. Furthermore as there is no alternative path tosh (other than the reverse computation from the successor state sh+1), thismust constitute the shortest reversible algorithm to generate the halt stateand therefore an equals sign is appropriate in the following equation.

H(ph, s0) = H(sh) = H(sh, g) = H(s0) +H(sh|s∗0) (6.4)

Note as the computation must halt for each th, th will need to increase toth+1 to generate the next halt state sh+1. The halt requirement restricts thenumber of programme bits that need to be added to define the state bits in thehalt state.

From an observer point of view, if the computation is to actually halt inunique state, the reversal instructions need to be isolated from the halt state.In this case, the history or garbage, g, must be eliminated to ensure that thetrajectory from s0 to sh is logically and thermodynamically irreversible. Thishappens automatically when the laboratory irreversible computer is reset toperform a new computation and the energy in unwanted history or programmebits diffuse away as heat 1.

However, a normal real world computation in an isolated system is not onlyreversible, but also does not halt. The implications of this are addressed later.

1HUW Price in his book “Time’s Arrow and Archimedes’ Point” [63] argues that scien-tists are muddled in their thinking about entropy and reversibility in statistical mechanics.The above approach shows entropy does not increase in an isolated system, but rather thebits specifying the interaction between states migrates to the momentum states apparentlyincreasing the entropy. The problem raised by Price would seem to be resolved or at worstshifted elsewhere.


The shortest computational path

This section shows that the trajectory algorithm of the form of Equation 6.2,generated by increasing th for each new halt state provides the shortest algorith-mic description for the overwhelming majority of states. Firstly, as mentioned,the reversible trajectory to give sh or (sh, g) is the shortest, as there can be noother path to this state. As all the information is accounted for in a reversibleprocess, an equals sign is appropriate in Equation 6.3 and Equation 6.4 show-ing that there can be no entropy change. But does the entropy in Equation6.3 determined from the stepping algorithm represent the shortest descriptionof the halt state of a system, or can a shorter algorithm be found that directlyoutputs the final state sh without generating g? I.e. is H(sh) < H(ph, s0)?The answer is that for the overwhelming majority of states, H(sh) = H(ph, s0)as specified in the following theorem.

Even when the computational history g is eliminated leaving sh, for theoverwhelming majority of states H(sh) = H(sh). Only in very rare cases canan algorithm shorter than the trajectory algorithm define sh. In these raresituations, sh must be generated by an irreversible algorithm having a differenthistory from the reversible trajectory algorithm

Proof: Consider the shortest reversible trajectory through a system’s bitspace (or state space) in the form of Equation 6.2. Let the halting configurationbe sh after th steps and let the history of the computation needed to maintainreversibility be represented by the string g. Thus having selected the shortestreversible path to generate sh, which includes the history g, after th steps, itneeds to be shown that for the overwhelming majority of states;

H(sh) = H(phs0) = H(sh, g) = H(s0) +H(th) +H(Compute next STATE).(6.5)

There are two situations that need to be considered.

1. When the equals sign is necessary in Equation 6.3 or Equation 6.4. This isvalid when the halt state sh has no shorter irreversible trajectory becausesh and g must be generated together by the same programme to map thereal world process. However, no shorter programme can generate both shand g as, being reversible, there can be only one forward computationalpath to the combined string sh. This situation implies that H(g|s∗h) = 0from Equation 3.20.

2. Where a specification of sh exists that is shorter than that implied by theabove trajectory with the equals sign in Equation 6.4, a specific argumentis needed to determine when this might occur. In this case, in contrastto the first situation, H(sh) < H(sh, g) , but H(ph, s0) = H(sh, g). I.e.when is a shorter description possible by calculating sh independentlyof g? This corresponds to one of those rare situations where an orderedstate appears in the trajectory as a fluctuation.

Case 1. The following argument shows that for the overwhelming majorityof output states, as in the first bullet point 1, H(ph, s0) = H(sh) = H(sh)


and a shorter description where H(sh) < H(sh, g) is rare. While a shorterprogramme may generate sh, no shorter programme can generate both sh andg as, again being reversible, there can be only one path to the combined stringsh, g.

While a shorter programme might generate sh by itself, as argued below thisis rare and the computation and the computational history must be differentfrom that generated by the stepping programme that produced the trajectory.The argument for the dominance of Case 1 rests on the observations that, asthere are very few short programmes, only in rare cases is there likely to be afluctuation one where |s∗h| < H(sh, g(s

∗0, sh)). This can be seen by noting that

because the physical laws and the initial state underpinning the evolutionaryalgorithm are highly compressed, the algorithm that describes an evolving sys-tem is dominated by the specification of th. I.e.

log2th >> |Compute next STATE|.As a consequence, the bits that define the halt state th, are the major contribu-tion to the increase in entropy and the entropy increases with each successivestep of the algorithm. As the system evolves, each successive halt state (givenby successive values of th) will need to be coded from all the self-delimitingstrings representing the integers between 0 and th.

For a given halt state many of the possible shorter algorithms to generatesh will already have been used to specify earlier states in the trajectory. Fur-thermore, as there are no more than 2n binary programmes shorter than thoseof length n, there can be at most 1 in 210, i.e. 1 in 1024, programmes shorterthan m − 10. The fraction of algorithms available to provide a significantshorter description of sh is extremely low, particularly when the self-delimitingrequirement reduces the available number even further. This can be seen byconsidering how many shorter programmes might be available to define thealgorithmic entropy of a mole of a Boltzmann gas. As discussed in section 3.3,where the resolution of the position- momentum space is 1 part in 28 somethinglike 273 bits is required to define a typical state. If there is a fluctuation fromthe typical state, no more than 1 in 1024 fluctuations can be described less than273−10 bits. Shortening the description by 10 bits is hardly be discernible. Forthe overwhelming majority of outcomes it follows that the stepping algorithmof Equation 6.3 represents the most compressed description of the final statefor both sh and g together and for sh on its own as there are very few shorterprogrammes that have not already been used. Therefore one can assume theequals sign in Equation 6.3 for the overwhelming majority of cases.

Case 2. However where a rare event appears as a fluctuation in the trajec-tory implied by Equation 6.3, H(sh) < H(sh, g)). In this situation, the entropydifference between the reversible algorithm and the shortest algorithm for shis H(sh, g) − H(sh). Information must be discarded from H(sh, g)) to defineH(sh) without reference to the computational history. This implies H(g|s∗h) isnow greater than zero. In this case the shortest programme that generates shwithout generating g cannot pass through the majority of states in the evolu-tionary trajectory as there can be only one path to most of these. The short


computation that generates sh is not reversible in the same way as the tra-jectory computation. As a consequence, sh, and the record of the history ofthe shorter computation that generates sh, must be different from the historygenerated by the stepping algorithm of equation 6.2. This implies a specificand longer description of g in the stepping algorithm for H(sh) < H(sh, g)) .When this happens, if all bits are taken into account, while sh is lower, some ofthe bits in the longer description g will have become new programme bits thatstore energy in new structures not seen before. I.e. a rare more ordered config-uration might occur when collisions between particles shift state configurationbits into new programme bits and in doing so store energy.

A somewhat similar case is where the number of steps increases in the tra-jectory algorithm taking the ergodic system through its equilibrium to a moreordered configuration the other side of the equilibrium set of states. In this casepast equilibrium, H(shg) will continue to grow as H(g) grows. However, asis discussed later in section 6.3, past the equilibrium, sh will start to becomemore ordered (i.e. a shorter description exists) as extra bits are transferredfrom description of sh to the history. In effect the state bits are transferredback to programme bits. This shortest algorithmic description in this case isthe algorithm that reverses the direction of the trajectory by going backwardsin time from the initial state.

A further point needs to be noted. Where a catalyst provides an alternativereversible trajectory, as more programme steps are required, the history of theprocess g′ will be greater. In this case the appropriate H(sh, g

′) will be greaterthan H(sh, g). While the computation with the catalyst will still be reversible,the algorithm that generates both sh and the new history string, g′, will belonger than the original algorithm that generates sh without the catalyst asthe system plus catalyst is larger. While the catalyst must be part of theinitial state or initial programme, it is regenerated at the end of the processand can be ejected at the completion of the computation indicating there is nonet entropy increase because of the actions of the catalyst. The algorithmicentropy must be derived from the shortest path to generate sh, not the mostefficient path in a time sense.

The equilibrium algorithmic entropy

As the reversible system steps through the allowable states, halting each timeth is increased, the length of the algorithm in equation 6.2 depends on H(th),the contribution of th to the algorithmic entropy. As th grows eventually H(th)dominates the algorithm as, relative to the number of steps in the algorithm,the natural laws are simple. As H(th) ≈ log2th, the overall algorithm trendstowards the log2th curve most of the time. But because shorter descriptions ofthe integer th exist for some values, there are many fluctuations where H(th)drops below this trend as figure 3.2 in Li and Vitanyi [79] indicates. However,when the algorithm approaches an equilibrium state the argument needs tochange.


As th grows, the size of the specification of th, given by |th| ≈ log2th changesvery slowly. Eventually, as th becomes sufficiently large it approaches Ω, thenumber of available states. The equilibrium states correspond to the enormousnumber states surrounding the mid point of the ergodic trajectory. In thisregions one can envisage th = αΩ where α is less than 1 but approaches 1.In this case, the algorithm is dominated by log2Ω + log2α. Hence the set ofequilibrium configurations will have an algorithmic entropy given by,

H(sequil) ≈ log2Ω+ log2α.

For the majority of states in the trajectory, the equation reduces to log2Ω,whenever log2α is negligible relative to log2Ω. E.g. for a mole of a Boltz-mann gas log2Ω would be something like 273 and for most states log2α will benegligible. In these cases

H(sequil) ≈ log2Ω.

The provisional algorithmic entropy is equivalent to the Boltzmann value butwith different units. Just as happened during the system’s evolution, theremight be spontaneous fluctuations away from equilibrium, as the system movesthrough the equilibrium set of configurations, even if there are no entropy ex-changes with the environment. When this happens, the history of the computa-tion becomes irrelevant. The shortest description is the algorithm that specifiesan equilibrium state directly rather than the trajectory. If the number of stepsincreases, taking an ergodic system past its equilibrium to more ordered con-figurations, H(sh, g) will continue to grow as H(g) grows but increasingly morebits will be transferred back to to programme states. The algorithmic entropyfor states in an ergodic system past equilibrium, will depend on the shortestcomputation path that determines sh by running the computation backwardsin time from the original state.

A somewhat analogous situation occurs when the system passes through aset of states sh with the same provisional entropy. Again, H(shg) grows asthe trajectory steps through each state in the set of states but the provisionalentropy does not change except when sh specifies a highly ordered fluctuationfrom the typical state. In this case H(sh, g) can be separated into H(sh) andH(g) as specifying sh provides no information that can be used to specify gmore simply.

Is the algorithm that tracks a replicating trajectory shorter than a

copying algorithm?

As most living structures are highly ordered and far from even a local equi-librium, the instantaneous configuration can be specified by an algorithm thathas fewer bits relative to the description of an equilibrium configuration. Forexample a tree can be defined from a replicating algorithm that generates atree from the DNA instructions in the seed while accessing external resourcesduring each step of the growth process (analogous to Equation 6.2) and thenhalts. This happens either because the observer defines the halt requirement,


or because a natural event is fed into the computation to terminate the growthalgorithm. Alternatively a tree can be defined algorithmically by specifyingeach cell type in the tree and the assembly routine that puts each cell variationin the right position to make a tree at the halt instant. The question is whichof these is the shorter?

Specifying the placement of each cell of the tree and the cell type on a caseby case basis is likely to require a longer algorithm than the one that generatesthe tree from the genetic code as far more detailed information is required todefine the position and variation of each individual tree cell when the tree isspecified in its final state once all degrees of freedom are accounted for.

If such an algorithm is to specify the final structure such as a tree, a routinethat copies each cell, allowing for the different variations, and places each copyin its appropriate place in the tree structure. For example, if there are Mvariations of N replicating cells, the particular configuration that needs to beidentified fromMN variations for the copying routine, requires at least an extraN(log2M) bits. Furthermore, each replicated cell will need to be positioned inspace and will also carry kinetic energy that in principle needs to be specifiedto exactly define the state. However from an observer perspective , the copyingalgorithm might appear shorter when degrees of freedom associated with theevolving algorithm are overlooked.

On the other hand, the replicating algorithm based on generating the treefrom the genetic code, specifies how the genes are expressed in the cells in dif-ferent parts of the tree. The cell variations and even the position of each cell,are determined by the rules of growth arising from the computing process anddo not need to be separately specified. While the DNA or the software gener-ating each cell is the same, gene expression occurs as the replicating algorithmcalls different routines depending on the environment of neighbouring cells andexternal inputs (such as the availability of water and light). Consequently acell in the leaf is expressed differently from a cell in say the root system but itis produced by the same replicating algorithm.

Perhaps a simpler argument is to see the natural world as a computer wherethe programme that generates the tree on the real world computer is analogousto equation 6.3. In which case the algorithmic entropy is determined by thealgorithms embodied in the seed, in the external resources needed to grow theseed and the final halt state that determines how much the tree should grow.As was argued above in section 6.3, the overwhelming majority of outcomesgenerated by the growth trajectory will be the shortest possible. Hence, oncesome growth has taken place, deriving the algorithmic entropy from the startcondition (the initial state of the seed, the resource inputs and the programme)will in almost all cases be shorter than attempting to derive a shorter algorithmthat specifies the final state of the tree.

Overall the above arguments suggest that, for the great majority of casesdescribing real physical situations, the replicating algorithm, while appearinglonger in the first instance, is likely to be the shortest when all is considered.In principle a growing tree is a reversible system, but throughout the growthprocess waste exits the system. As the waste drifts away reversibility becomes


highly unlikely in the life of the universe. This leads to the conclusion thatvariation or diversity in structures in a system increases the algorithmic en-tropy. From the previous argument, it would appear to be algorithmicallymore efficient to generate any variation through the software (e.g. variationin DNA) rather than independently specifying a variation of the object at thestructural level as in the tree example. As Devine [47] has pointed out, it isplausible to conjecture that building the variation as switches to be interro-gated by the software decreases the entropy in a way that tends to compensatefor the entropy cost of building in diversity into near identical structures. Withthese understandings it becomes possible to track algorithmic entropy flows andtherefore thermodynamic entropy flows in real systems as will be discussed insection 9.8.

Chapter 7

The minimum description lengthapproach

The Minimum Description Length approach (MDL) was developed by Rissanen[87, 91] (see also Barron et al [5]) as a practical method to find the best modelto describe a sequence of data points. There are basically two forms of MDL.

The first, a true algorithmic approach, is known as ideal MDL [109]. It isthe theoretical approach that seeks the shortest description of a data set bya two part algorithm. The first algorithm captures the regularity of the datawith a hypothesis or model. The second part is the algorithm that capturesthe deviation of the actual data from the regularity, given the model. If themodel is too simple it may have a short description, but this will be more thanoff-set by the longer description of the deviations from the regularity impliedby the model. An optimum clearly exists. The algorithmic or ideal versionis an extension of the AMSS or provisional entropy approaches described insection 4.4. However while the ideal MDL approach finds the algorithm thatbest fits the data, it does not generate the data as required by algorithmicinformation theory. Nevertheless, once a short decoding routine is appendedthe data can be generated from the optimum coding algorithm. In which case,the length of the MDL algorithm with this decoding routine added is virtuallythe same as the provisional entropy. In the ideal form, the model captures theregularity of the set and the string of interest is identified given the set. Likethe provisional entropy, the ideal approach also suffers from the fact that thealgorithmic description is non-computable and there is no systematic procedureto decide whether the understanding embodied in the hypothesis or model isthe best possible. Different models can be investigated, but the process is adhoc.

Practical MDL, the second approach, addresses the non-computability prob-lem by replacing the idea of algorithmic complexity with stochastic complex-ity. I.e. rather than finding the length of the shortest actual algorithm, thestochastic complexity seeks the minimum code length based on Shannon’s Cod-ing Theorem. As is outlined later, practical MDL provides a tool for selecting

109

110CHAPTER 7. THE MINIMUM DESCRIPTION LENGTH APPROACH

the best model from a class of models by a systematic process. If the classis sufficiently broad, such as the class of polynomials, or the class of Markovmodels, the process identifies which model in the class describes the data withthe fewest number of bits. Inevitably, the optimality is relative to the class orclasses chosen. If this were not so, the approach could be used to determine allthe laws of physics.

In what follows the ideal or algorithmic MDL and the general principlesof MDL will be discussed. Following this, the practical MDL approach willbe briefly outlined as a model selection process. Once a model is established,it can be used define the appropriate set, just as is done for the provisionalentropy approach. However, as the practical approach to MDL is a separatediscipline in its own right, it is not the focus of this article. Rather here themajor focus will be on the algorithmic version and its identification with theprovisional entropy. But first the discussion on optimal coding (3.3) needs tobe revisited.

Much of what is to be discussed in this section is the same as that dis-cussed in section 4.5 which generalised the algorithmic approach to model de-velopment. This generalised approach is virtually identical to the Minimum De-scription Length. Take a string y(n) representing the sequence (y1, y2, y3 . . . yn)that is the output data of a computational process given the input data stringx(n). The ideal MDL approach provides a more formalised method for cap-turing data that links Algorithmic Information Theory with model selection.As was mentioned in section 4.5 a model, in effect, is a way of compressingdata. A good model is one that has a maximally compressed description ofthe data. Consider as an example planetary motion under Newton’s laws. Theregularity is in the algorithm that implements the laws. This is the model. Thevariation (which may be due to the limits of the model, hidden effects, noise,or experimental error) of each data point yi will appear as random deviationsfrom the model. The optimum compression of a particular data point is givenby the provisional entropy or the Algorithmic Minimum Sufficient Statistic aswas discussed in section 4.2. The model specifies the characteristic of the set ofvalid outcomes, and the variation or deviation from the ideal is captured by thecode that identifies the particular string in the set. The deviation will appearrandom, because if it were not, the model could be altered to give a better fit.The ideal MDL approach selects the Algorithmic Minimum Sufficient Statistic(AMSS) or the best estimate of the provisional entropy for each data point.As is discussed below all typical strings in the set defined by the model willhave the same description length. A few strings which are not typical willhave shorter descriptions. The following paragraphs make this clear using analternative derivation of MDL based on Bayes’ rule. This is helpful because ofthe insights the derivation provides, and because of the links into the practicalversion of MDL. Li and Vitanyi [79] outline the principles.

Returning to Bayes’ rule outlined in section 3.8, if a set of hypotheses ormodels can be expressed as a string, the probability of the particular hypothesis

111

or model Hi from a set of alternatives given the data D satisfies:

P (Hi|D) = P (D|Hi)P (Hi)/P (D). (7.1)

The hypothesis or, in our case the model, may be articulated as a physical law; asimple model (such as a straight line fit to the data); or be from a class of modelssuch as the polynomial, Bernoulli, or Markov models. The approach outlinedbelow does not establish the parameters of the possible models; rather, giventhe best fit obtained from a number of possible models, the approach establisheswhich of these best-fits is preferred. For example a trial fit of the data could usea straight line based on a standard fitting process such as least squares. Then asecond degree polynomial could be trialled and so on. This process establishesthe optimum parameters for each potential model. Given these models withtheir parameters, the MDL approach then establishes which of the possible fitsprovides the shortest algorithmic description of the data. Because higher degreepolynomials require more parameters, higher degree polynomials require longerdescriptions. Such a polynomial will only be preferred if the deviations fromthe model are sufficiently reduced to compensate for the longer description.The ideal MDL approach considers only hypotheses that can be expressed asfinite sets of binary strings of finite length [109]. In which case the ideal MDLapproach can be derived by taking the negative logarithm of Bayes’ equation(7.1) above, to get

−log2P (Hi|D) = −log2P (D|Hi)− log2P (Hi) + log2P (D). (7.2)

The Bayes’ approach selects the hypothesis that maximises the probabilityP (Hi|D), i.e. the hypothesis that, given the data has the highest probability.This is equivalent to choosing the hypothesis that minimises the right handside of equation (7.2) and in so doing, provides the shortest description of thedata. However, log2P (D) is independent of the hypothesis or model and so isirrelevant to the minimisation process and can be ignored. Where, as is usuallycase, the prior probability in the above equation is not known, the universaldistribution m(Hi) can be used as the prior (see section 3.8) to replace P (Hi).The universal distribution is the best estimate of the unknown probability. Fur-thermore, provided that the restrictions outlined below are satisfied, m(D|Hi)can also replace P (D|Hi) and the equation to be minimised becomes:

−log2m(D|Hi)− log2m(Hi).

Because of the algorithmic version of Shannon’s coding theorem (section 3.8),the universal distribution m(.) is equal to 2−H(.). Hence H(Hi), the shortestalgorithmic description of the model hypothesis Hi, can replace log2m(Hi)in the above equation. This leads to the equivalent expression of the aboveequation in algorithmic entropy terms. I.e. to within the O(1) constant, theoptimum model minimises:

Halgo(Hi) +Halgo(D|Hi).


This is equivalent to finding the provisional entropy of the hypothesis or modelwhere

Hprov(D) = Halgo(Hi) +Halgo(D|Hi).

I.e. the hypothesis that minimises the provisional entropy, or selects the Al-gorithmic Minimum Sufficient Statistic, as given by equations 4.8 and 4.6 andwhich is generalised to include the probability distribution as outlined in sec-tion 4.5. If there is more than one possibility with the same minimisation, thehypothesis Hi chosen is the one with the shortest H(Hi). Effectively the MDLargument is that the shortest algorithmic description, given the informationavailable, is the preferred hypothesis.

For a finite set of hypotheses, the Bayesian approach coincides with theideal MDL using m(Hi) as the prior probability provided that the substitutionof H(D|Hi) for m(D|Hi) is valid. This substitution is only valid where the datastring, represented by D is a typical member of the data set implied by thehypothesis. To be typical, the string must be random with respect to the set.This can be illustrated by considering a model that describes a particular stringas the outcome of Bernoulli process where the probability of a ‘1’ is p and thatof a ‘0’ is 1−p. If the hypothesis is that p = 0.75 the approach will only work ifthe outcome is a typical string in the set of strings satisfying the hypothesis. Astring of 75 1’s followed by 25 0’s is not a typical string in the possible strings.The particular string mentioned above would only be typical (or random) for adifferent model. In general, replacing m(D|Hi) by H(D|Hi) is only valid whenthe data string is random with respect to the hypothesis. If the data stringis not typical, just as in the AMSS case a new hypothesis or model is needed.When the right hypothesis is found, and each member of the set is equally likelywith respect to the new hypothesis then log2[P (D|Hi)] = −log2(1/Ω) whereΩ is the number of data strings in the restricted set. Similarly in algorithmicterms, H(D|Hi) = log2Ω. Also, like the AMSS and the provisional entropyapproaches, the minimum description length codes a string in a two part code;one part that defines the model and the other part that specifies the particularstring, given the model. Alternatively, this model can be seen in terms of thefirst part describing the theory, and the second describing how, given the theory,the data with deviations can be exactly fitted. Strictly, while MDL identifiesthe optimal hypothesis or model, it only becomes an algorithmic measure whena short decoding routine is included to generate the string from the model’sparameters.

The MDL approach highlights again the fact that the most compresseddescription of the set of observations y(n) = (y1, y2, y3, . . . yn) involves findingthe model that best fits the set of observations in a way that minimises thedeviations from the model of each data point yi in the data string. The idealMDL approach is identical to the argument outlined in more detail in section4.5. For example, a straight line model used for an MDL approach will attemptto explain a noisy set of data by the equation yi = axi + b + noise or yi =y(xi) + noise. Here the hypothesis is the straight line model, y(xi) but whichalso includes a probability distribution to cover the noise or variation. I.e. the

113

noise or deviation from the model for a particular data point xi is yi − y(xi).The minimum description approach minimises the lengths of the algorithmicdescription of the model plus any deviation due to the noise. Such an approachwould require 2d bits to specify the constants a and b to a precision d,andparameters c and σ to specify the probability distribution that generatesP (yi−y(xi)) to the same precision. The deviations from the model are specified usinga code based on Shannon’s Coding Theorem as outlined in section 4.5. As isoutlined in more detail in Section 4.5, in the linear case, where y(x) is a straightline, the best minimum description for a particular string, given the input xi,becomes the provisional entropy:

Hprov(yi) = H[(axi + b), P ] + |code[yi − y(xi)]|+ |decoding routine|

Here |code(yi − y(xi))| = −log2P (yi − y(xi) allowing for the need to roundup the probability to give an integer code. Once MDL has selected the opti-mum model, or the equivalent set, and the typicality requirements have beenmet, the model gives the provisional entropy for a particular string. However,there may always be a shorter description waiting and the shorter descriptioncould point to a better model. A more practical process that overcomes thenon-computability issues in ideal MDL in order to determine which model isoptimum given the data is discussed in the next section

Practical MDL

The practical MDL approach developed by Rissanen, by-passes the computabil-ity problem inherent in the ideal MDL approach to model selection by definingthe stochastic complexity to replace the algorithmic complexity [87, 90, 91].The stochastic complexity is computable. It corresponds to the minimum de-scription of the observed data string using the optimal codes based on proba-bilities, rather than on an algorithm (although at times the most efficient codemight well be virtually the same as the shortest algorithm). Thus the stochas-tic complexity is defined as the shortest description of the data, relative to aclass of probabilistic models. Stochastic complexity was inspired by the AITapproach and Solomonoff’s approach to induction (see section 3.8). Reviewscan be found in Grunwald [62] and Hansen and Yu [64].

Shannon’s Coding Theorem, which was discussed in section 3.3, shows thatan optimal code can be found for a countable set of objects denoted by yi,distributed according to a probability P (yi). These objects can be strings and,in this section these strings are considered to be derived from particular mod-els. As section 3.3 points out, Shannon’s noiseless Coding Theorem (Equation(3.18), demonstrates that optimum codes from a prefix free set can be chosento virtually equal to the rounded probability namely that:

|code(yi)| = −⌈log2(P (yi))⌉.

As the probability is usually not an integer, the code length will be the nextgreatest integer. The convention is to denote the code length by −log2P with


the understanding it is to be an integer. One further point is that where theprobability model is expressed as a probability density, the variables yi aretaken to a chosen level of precision such as four significant figures. In thiscase a probability −log2P (yi) for each yi is replaced by −log2[P (yi)δ], whereδ represents the grid size determined by the resolution. As the δ is constantthroughout the MDL optimisation procedures, it can be ignored.

Again the principal behind the approach is that the better the model of theobserved data, the more compressed the description will be. This suggests thatthe best model or models will be those that provide the shortest description.However the question arises as to whether the process has advantages overother statistical methodologies that select hypothesis or models. In general,the practical MDL approach avoids over fitting, where there are too manyparameters in the model relative to the data, as well as under fitting where themodel is too simplistic. When there are too many parameters, the error termis small, but the overall description including the model description is longer.Under fitting the data gives rise to excessive variation in the data from thesimplistic model’s values, which requires a longer description of the deviation.For example, a data string of 20 points could be fitted by a simple straight linebut the deviations might require a large contribution from the −log(P ) term.However a model with 20 parameters could give an exact fit but the modeldescription might be too long. The best fit lies between these. In a practicalsense, MDL satisfies Occam’s razor.

The approach has some important differences from traditional data fittingas there is no assumption that the process produces the true model or trueparameters underlying the data. Rissanen points out [90] that the optimalmodel is not an approximation to anything; it just fits the data. The practicalMDL approach can be derived from Bayes’ theorem as outlined previously forideal MDL. Alternatively, it can be argued that when the data is maximallycompressed, all the useful information has been extracted. Minimising thestochastic complexity becomes, in effect, ensuring that the data is randomrelative to the class of models and that all pattern has been captured in themodel in a manner similar to the AMSS approach.

Given a data set, the practical MDL approach seeks to find which member ofa class of models provides the shortest description of the data. The MDL modelclass under consideration is usually expressed in terms of the parameters thatdefine the class. The class could be a set of linear regressions where the data yis a linear function of several variables say x1 to xn. In the non-linear regressioncase a similar set of parameters x1 to xk+1 representing the coefficients of apolynomial of degree k can be used. In this case, the model can be representedin terms of the parameter set x and the model label k. A general Markov modelis described by its order k, representing the length of the history required todefine the future together, while in this class of models the parameter setbecomes a set of probabilities p(k). A Bernoulli process is the simplest Markovmodel with k = 1 and a probability p. (Note here that P (y) represents theprobability of string y in the set of strings, while p is the probability of the nextcharacter in string y.)The non-predictive version [87, 88] shows that in certain

115

circumstances the minimum description of a string of length n reduces to:

−log2P (y|p(k), k) + (k/2)log2n+O(log2n).

The second term, derived from the stochastic complexity, parameterises themodel class, while the first term is a probability based code that codes for thedata given the model specified by parameters pk and k. Consider a data stringgenerated by a Bernoulli process, with r 1′s in the string of length n, and withk = 1. The shortest description for large n depends on nh(p) + (log2n)/2,where h is the entropy rate - the entropy per symbol [89]. The maximumlikelihood estimate of the probability gives p = r/n. However if a higher orderMarkov model is to be considered as a better explanation of the data, k willincrease adding (k/2−1)log2(n/2) bits. If this increase is more than off-set by adecrease in the code length derived from the probability term, the latter modelwill be preferred. The optimum description is the one that minimises the totallength over the whole class of models. The practical MDL approach has evolvedsignificantly since the early work of Rissanen [87, 90, 91]. Later versions usea normalised maximum likelihood model to provide better descriptions of thedata.

Maximum Entropy formulation

The “maximum entropy” approach developed by Jaynes is an attempt to makethe maximum use of information when modelling a system [69, 70, 71]. The ap-proach, which has at its heart Laplace’s principle of indifference, seems to workdespite the scepticism of the traditional frequentist understanding of probabil-ity. Interestingly, as the following outlines, the Minimum Description Lengthapproach, can be used to justify the maximum entropy approach of Jaynes[88, 79]. In other words, minimising the description length is equivalent tomaximising the Shannon entropy.

In an actual experimental situation, the true probability pi of a particularoutcome i may not be known. If there are n independent trials and k outcomesand the outcome i occurs ni times, the probability of outcome i given by pican be approximated by ni/n. However while ni may not actually be known,any information observed determines the constraints on the possible spread ofoutcomes. For example one constraint should be that the total approximationsto the probability should be 1. I.e. Σni/n = 1. Similarly for an energydistribution over a number of outcomes each with energy Ei and total energyE, the average observed energy per state is constrained by ΣEini/n = E.These constraints limit the possible values of the probabilities for each outcome.In which case, the Maximum Entropy principle can identify the model thatbest fits the constraints by determining the probability distribution pi thatmaximises the entropy. If the Shannon entropy

H(p1 . . . , pk) = −Σipilog2pi. (7.3)


has a maximum when pi = pi, the set of probabilities pi is the preferred distri-bution.

In the Maximum Entropy formulation, the Lagrangian multiplier is used tocapture the information embodied in the constraints to find the value of pi thatmaximises the entropy. This is effectively the approach used in the derivationof Maxwell-Boltzmann statistics. Li and Vitanyi page [79] give the example ofthe so-called Brandeis dice problem (see also Jaynes [70]), where the averageoutcome of the throw of a dice indicates bias. The observed bias constrains theallowable probabilities. The set of probabilities pi that maximises the entropy,given the constraints, is then found. The following outlines why minimising−log2P (x|p) is equivalent to maximising the Shannon entropy.

In general, the linearly independent constraints can be represented by aseries of equations of the form: Σjai,jnj/n = xi, Where xi represents theobserved value of a physical quantity (Rissanen [88]). There will be a setof possible distributions p = (p1, p2 . . . pk) each of which satisfies the aboveconstraints, and each will have an associated probability as a hypothesis forexplaining the constraints. (Note that here the probability of the hypothesisP representing a particular model, is to be distinguished from a particular setof probabilities p that satisfy the constraints x.) Any allowable probabilitydistribution should fit the data hence P (x|p) = 1. In which case, the idealcode length associated with the model ought to be −log2P (x|p) = 0. Thusthe best set of probabilities is the set that minimises −log2P (x|p). For largesequences of data, this is equivalent to maximising n!/(n1! . . . nk!). This, givenStirling’s approximation, is equivalent to maximising the entropy of equation(7.3) demonstrating the validity of the Jaynes’ approach.

Interestingly too, Jaynes [71], Appendix A argues that Kolmogorov’s moregeneral probability approach is consistent with his own.

Chapter 8

The non-typical string andrandomness

8.1 Outline on perspectives on randomness

Mathematicians have had great difficulty in defining randomness in a robustmanner. For example, Von Mises [111] attempted to define a randomness testfor an infinite sequence by requiring that relative frequencies converge for allsub sequences. While intuitively this would seem to be reasonable, mathe-maticians have identified weaknesses with the approach. However AlgorithmicInformation Theory provides a number of different perspectives on randomnessthat ultimately lead to a more robust approach. From an AIT point of view,most strings are random in the sense that they are not compressible. However,while the sequence ‘111 . . . 11’ of length 100 is as just likely to occur by chanceas any other string of the same length, we do not see it as random. The reasonis that ‘111 . . . 11’ is compressible and is therefore not a typical member of theset of all strings of length 100.

Typicality is related to the equilibrium concept in statistical thermody-namics. The different possible configurations of a gas in phase space (position-momentum space) can be represented by a series of 1’s and 0’s depending onwhether each cell in phase space is occupied or not (see section 5.2). Theoverwhelming majority of strings in an equilibrium configuration are typical asthey have no distinguishing features. However some relatively rare strings arenot typical and can be algorithmically compressed. As these can be describedby a shorter algorithm than that describing a typical string they are said to beordered. Such an example is the configuration ‘111 . . . 11100000 . . . 000’, rep-resented by a string of repeated ones and repeated zeros. While such a highlyordered string can occur by chance it is very unlikely for this to happen.

Martin-Lof, a colleague of Kolmogorov, recognised that randomness is todo with typicality and the idea of incompressibility. He developed the ran-domness test known by his name. Chaitin developed the equivalent concept ofm-randomness which is probably simpler to grasp. The following outlines the

117

118 CHAPTER 8. THE NON-TYPICAL STRING AND RANDOMNESS

key AIT perspectives on randomness. However, as the different perspectivescan be a bit overwhelming they are summarised first and will be discussed inmore detail in the following sections.

• Most strings are algorithmically random as is discussed below, as few canbe algorithmically compressed by more than say, 10 %.

• Kolmogorov developed a measure he called the “deficiency in random-ness” of a string; i.e. the more ordered a string is, the more it is deficientin randomness.

• Martin-Lof [83] defined a universal P -test for randomness that is as goodas every conceivable valid randomness test. If a string is deemed randomat a particular level by a universal test, no valid test of randomness cando any better. Randomness deficiency defines a test for randomness thatis a Martin-Lof P -test, and is therefore universal.

• Chaitin’s measure of m-randomness, where m is the number of charactersthe string can be compressed algorithmically, is equivalent to a Martin-Lof universal test of randomness allowing for the fact that in the Chaitincase self-delimiting codes (i.e. codes from a prefix-free set) are used.

• Another alternative test is the sum P -test of Gacs [57]. The idea behindthis is that a judicious betting process can turn a profit if the string isclaimed to be random but is actually ordered. The outcome of the sum P -test is virtually identical to the Martin-Lof P -test based on deficiency inrandomness, except that again the sum P -test requires the computationto be restricted to a prefix-free set of routines.

• Algorithmic Information Theory makes considerable use of universal con-cepts; i.e. those that dominate all others members in the class. In thissense the concept of a universal test for randomness is analogous to theconcept of a universal computer (additively dominates) and the conceptof a universal semi measure (multiplicatively dominates).

Before looking at the different perspectives in detail, the concept of random-ness deficiency will first be outlined. AIT recognises that most real numbersof a given length are algorithmically random. This can be seen as there are2n − 1 binary strings with lengths less than a given n. In simple terms thismeans that there are just under 2n strings shorter than n. If one wishes tofind the number of these strings that can be generated by an algorithm shorterthan n − m, there are only 2n−m shorter algorithms available. For example,if n = 100 and m = 10, there are 1.27 × 1030 strings of length 100 but only1.24×1027 strings of length less than 90. Hence at most, only 1 in 1024 stringsof length 100 can be algorithmically compressed by more than 10 %, as thereare too few shorter strings available to provide the compressed description. Inpractice there will be even fewer strings available as many shorter possibilitieswill already have been used. For example some of these short string will not

8.1. OUTLINE ON PERSPECTIVES ON RANDOMNESS 119

be available as they are a compressed representation of strings longer than n;others will not be available as they themselves can be compressed further. Asmost strings cannot be significantly compressed the algorithmic entropy of astring x of length n is most likely to be close to n+ |n|; i.e. the strings actuallength, and the size of the code for its length. Those strings that cannot becompressed to less than n−m are called “m-random” (see for example Chaitin[26]).

Given the uniform distribution that corresponds to the situation whereeach outcome is equally likely, Kolmogorov’s randomness deficiency provides ameasure of how non-random a string. The less random the string, the greateris its randomness deficiency. For a string of length n the randomness deficiencyd(x|n is defined by

d(x|n) = n− C(x|n). (8.1)

For most strings of a given length, C(x|n) = n as most cannot be generatedby an algorithm shorter than n. In which case d(x|n) = 0 and such stringsare deemed random. The greater d(x|n) for a particular string x, the lessrandom the string is, or in Kolmogorov’s terms the greater is its deficiency inrandomness. A highly ordered string, such as a string of n repeated ones hasd(x|n) = n as, in this case, C(1n|n) = O(1).

This definition uses C(x|n), the plain algorithmic complexity of string xgiven its length |x| = n. While the definition could be in terms of H(x|n) usingself-delimiting coding, the examples discussed below will use simple coding asthe discussion using C(x|n) as the complexity measure is more straightforward.However, as is discussed later, the outcome of a randomness test known as thesum P -test that uses self-delimiting coding, or coding from a prefix-free set,will differ slightly from the P -test that uses plain coding.

Returning to the concept of deficiency in randomness, where d(x|n) is large,and close to |x| the string is highly non-random. Such a string is non-typicalas its algorithmic complexity is much less than its length. Hence the label‘deficiency in randomness’ as the low value of C(x|n) indicates it can be sig-nificantly compressed. For example, the string ‘11 . . . 1’ of one hundred 1’s canbe compressed to about log2100, i.e. C(x) = log2100 + O(1) (equation (3.1)).As C(x|100) = O(1), its randomness deficiency given the string is 100−O(1);i.e. close to 100. Clearly 111 . . . 1 is a non-typical string in the set of stringsis of length n = 100. This approach resolves the intuitive problem that, whiletossing a coin and obtaining one hundred heads in a row is just as likely as anyother outcome it is somehow completely unexpected.

Another way of viewing this is to divide a set of outcomes into two setswhere the typical set contains those strings that can be compressed very littleand the non-typical set those elements that can be significantly compressed.Most outcomes will fall into the typical set containing the more random out-comes. A significantly compressible or a non-typical string will only occurrarely in a coin toss (unless the coin is biased). As has been mentioned, this isanalogous to the argument used in statistical thermodynamics that equilibrium(i.e. random) configurations are overwhelmingly the most likely. While an or-


dered configuration from the equilibrium set can be observed in principle, thiswill happen only as a rare fluctuation. It is seen that randomness deficiencyis also a measure of how distant a configuration in the natural world is froma random or equilibrium configuration. When this difference uses H(x) ratherthan c(s|S) it measures the degree of order in a natural system in terms of itsdistance from equilibrium (Sections 6.2, 9. Chaitin’s [26] degree of organisa-tion, is also a closely related measure¿ it is the difference in algorithmic entropybetween a configuration, that at the particular small scale shows no structure,and the configuration of interest at a larger scale. This is effectively the degreeof order where order is related to the scale of the observation.

A further point to be noted now, but which will later be seen to be sig-nificant, is to look at the fraction of strings that can be compressed a givenamount. Consider the string x of length n generated by an algorithm of lengthless than or equal to n −m. It can be compressed to at least n −m bits. Inwhich case C(x|n)) ≤ n−m and the randomness deficiency, d(x|n) ≥ m. Thefraction 2−m+1, of strings with d(x|n) ≥ m is small. E.g. for n = 100 andm = 10 only 1 string in 512 can have d(x|n) ≥ 10. It can be seen that m canbe used as a measure of randomness.

The previous discussion involved defining randomness for strings of a givenlength. This implies the uniform distribution. However, the measure of ran-domness deficiency can be extended to a string x in a more general set ofstrings S. The notation used here will be different from that commonly usedfor the number of elements in a set S. This is often denoted by |S|. However inthis book, the vertical lines are taken to refer to the length of the algorithmicspecification. Hence to avoid confusion, ΩS will be used to denote the numberof members of the set. For large sets, most strings will be typical and will berepresented by a code of length close to log2ΩS , i.e.. the algorithmic entropyof a typical string is close to the Shannon entropy of the set.

The randomness deficiency for the string x in a set S is given by

d(x|S) = log2ΩS − C(x|S). (8.2)

If the string is ordered and can be compressed, its randomness deficiency willbe high as it is an non-typical member of the set. Consider the set S of allstrings of length n with r 1’s and (n−r) 0’s. An example is the set of sequencesof length 100 with 80 1’s and 20 zeros, which will be called the 80:20 set. Inthis case the set has ΩS = 100!/(80!20!) members and the typical string can becompressed to log2ΩS = 69. Thus any member of the 80:20 set can be codedin a way that compresses the representation to no more than 69 characters incontrast to a truly random string which would require 100 characters. So whilethe strings in the 80:20 set are ordered relative to all strings, the typical stringin the set has C(x|S) ≈ 69 which from equation 8.2 gives d(x|c) = 01. However

1For large strings Stirling’s approximation can be used to show that log2Ωs = nh whereh is what is called the entropy rate. The entropy rate is the entropy per symbol for astring produced by a sequence 0s and where, for example, the probability of a 0 beinggenerated is p and that of a 1 is 1 − p. In this case the probability p would be 0.2 and

8.2. MARTIN-LOF TEST OF RANDOMNESS 121

there are non typical strings in this set that can be can be compressed further.They can be deemed “deficient in randomness” relative to the typical string inthe 80:20 set where the typical string is compressed to 69.

A non-typical member is the set consisting of 80 1’s followed by 20 0’s, i.e.111 . . . 11000 . . . 00. The algorithm that generates this string would of the form‘PRINT 1, (100 − 20) times and PRINT 0, 20 times. As n = 100 is given aspart of the definition of the set, only the code for 20 is needed to generate thisstring as 80 = 100− 20 can be calculated. As log220 = 4.32, 5 bits are neededto code 20 and C(x|S) ≈ 5+O(1). Here the O(1) covers the specification of 0,and 1, and the calculation of 100-20. The randomness deficiency of the orderedstring relative to the typical 80:20 string will be d(x|S) ≈ 69 − 4 − 0(1) =65 − O(1). It can be seen form equation defc that d(x|S) for the set S of all80:20 strings of length 100 will range from around 0 for random strings whereC(x|S) = 69 to log2ΩS −O(1) for a highly ordered 80:20 string.

An extension is where the deficiency in randomness is defined relative toa recursively enumerable probability distribution P . In which case a typicalmember of the set will be coded by its probability in the set. According toShannon’s Noiseless Coding theorem, the optimum code length for such a stringwill be the integral value of −log2P . In this case a typical member is one wherethe length of the algorithmic description fits the logarithm of its probability inthe set. Most strings can be compressed to a code of length −log2P . However,an ordered string or one that is non-typical relative to the probability, can becompressed by more than log2P , and the deficiency in randomness becomesd(x|P ) = −log2P − C(x|P ), which is the same as the earlier definition ofequation 8.1. For the simple case of the uniform distribution of all strings oflength n each of the 2n string is equally likely. Hence P = 2−n and d(x|P ) =n− C(x|n).

The deficiency in randomness measures how much an ordered string canbe compressed relative to a random string. As the following outlines, thedeficiency in randomness provides basis for a robust randomness test.

8.2 Martin-Lof test of randomness

Martin-Lof, who worked with Kolmogorov for a short period, developed a testfor randomness now known as the Martin-Lof test [83]. However, before dis-cussing the test in detail the idea will be explored. A string of length say 100,consisting of a mix of 50 ones and 50’s zeros would be probably be consideredrandom unless there was an obvious pattern such as the 50 ones forming theinitial sequence. But what about a mix of 49 ones and 51 zeros, or 40 onesand 60 zeros? In other words how much variation from the 50:50 ratio wouldcorrespond to a non random string.

h = −plog2p+ (1− p)log2(1− p) = 0.72. When multiplied by 100, the length of the string,this gives 72 rather than 69. For Stirling’s approximation to be valid the string needs to belonger


Martin- Lof’s recognised that, while each string in the set might be equallylikely, it is possible to divide the set into regions based on some quantifiablefeature of the strings. If this can be done, and one of these region has fewerstrings than the other, the probability of a string falling in the region with fewerstring will be lower than the typical string which corresponds to strings fallinginto the larger region. Randomness is always relative to an implicit or explicitset of string. For example a string with a sequence of 100 1’s in the set of allstrings with length 100, would intuitively seem less random than any stringwhere there is a roughly even mix of ones and zeros. The trick is to define thefeature that puts the string of 100 1’s in a region which has few strings, andtherefore there is a low probability of any string falling in this region. Sucha region would contain strings that are deemed non-typical. While all stringsare equally likely to be observed, the probability of observing a string that fallsinto a region with appropriate features or characteristics can differ markedlyfrom one falling in the typical region. Even though there is no absolute cut offbetween typical and non-typical, the boundary between the typical and non-typical regions can bet set at a chosen significance level defined by a positiveinteger m.

The quantifiable feature that divides the strings into regions is the testfunction Test(x). This test function is designed so that the probability that astring falls into the non-typical region is given by ǫ = 2−m and the probabilityfor the typical region of the majority is 1− 2−m. A succession of typical/non-typical regions can be characterised by halving ǫ at each increment of m. Asm increases, the typical region expands and fewer strings will be deemed non-typical or ordered, and more strings will be typical or random. At m = 0, i.e.ǫ = 1, no strings are typical. At m = 1 half the strings will fall in the typicalregion as the value of m implies the cut off is ǫ = 2−1. At m = 2, 75% ofstrings will fall in the typical region and be deemed random at this level, while25% will be deemed non-typical hence will be seen to be more ordered. As canbe seen in figure 3, the typical region of the majority at level m + 1 includesthe typical region at level m, m − 1 etc. As m increases, more strings will bedeemed random as the typical region grows. In other word the requirement tocategorise a string as non typical is more restrictive as m increases.

The Martin-Lof test of randomness for a string effectively uses a test func-tion to ask whether string x lie outside the majority or typical region at thesignificance level m and therefore can it be rejected as being random at thatlevel m? Such a test must be enumerable and lower semicomputable to be ableto be implemented. The test does not show whether a strings is random, ratherit determines the basis on which a string can be rejected as being random. Ifthere are 2n strings, the key constraint on the test is that it must assign nomore than 2n−m strings to the non-typical region defined by a particular m.I.e. if m = 2, no more than 25 % of the strings can fall in the non-typicalregion, otherwise the test would be inconsistent as there are insufficient binarystrings available. There are therefore two aspects to the test. The first is toensure that, under the test, the fraction of strings belonging to the reject ornon-typical region is no more than 2−m, and the second is to have a test proce-


Figure 8.1: Test(x) ≥ m defines string x as non-typical or non random in theregion m where the total probability ≤ 2−m

dure to determine whether a particular string belongs to the reject region. Thistest procedure defined by Test(x) determines where Test(x) ≥ m for string x.If the fraction is less that 2−m for Test(x) ≥ m, then x falls on the non-typicalside of the typical/non-typical boundary at significance level m and thereforex can be rejected as random The mathematical formulation of this, when giventhe set of relevant strings characterised by a recursive (computable for all x)probability measure P (x), becomes [79]

ΣP (x) : Test(x) ≥ m ≤ 2−m.

In words, the above test states that that one identifies all the strings where thetest has a value ≥ m. The string can be rejected as random and deemed orderedat the level m if the cumulated probability of all strings in the region ≤ 2−m. Ifthis requirement is satisfied a string can be consistently assigned to the reject ornon-typical region for the given value ofm. If there were more than 2−m stringsin the region, the test would not be sufficiently discriminating as there cannotbe that many non random strings. This assignment is a Martin-Lof P-test ofrandomness. If there are 2n strings in the set, an equivalent definition, can bebased on the number of strings (denoted by #(x)) that satisfy the criterionthat Test(x) ≥ m. I.e. in this case the definition becomes:

#(x) where Test(x) ≥ m ≤ 2n−m.

Martin-Lof test examples

The Martin-Lof process can test whether a sequence produced by tossing apossibly biased coin is random or not; more accurately at what level wouldthe observed string be deemed random. Let f(x) be the number of 1’s in thestring representing the outcome of n tosses of a coin that might be biased.One might expect f(x) to be about n/2 if the outcome is random but howfar from n/2 is reasonable for a random string? The procedure is to devise


a test of the above form that involves f(x). For example Cover, Gacs andGray [37] use Chebyshev’s inequality to develop a test function Test(x) =log22n−1/2|f(x)−n/2|. As there are 2n−2m strings that satisfy Test(x) ≥ m,this satisfies the test requirement as 2n−2m < 2n−m.

An alternative test is to use the entropy rate mentioned in the previoussection 8.1. Again let the number of 1’s in a string of length n be f(x) and definep = f(x)/n so that (1− p) = [(n− f(x)]/n. For large n, the Shannon entropyper symbol is termed the entropy rate; i.e. h(x) = −plog2p− (1−p)log2(1−p).With appropriate coding, and for large n, a string of length n can be compressedto at least nh(x), i.e. the string can in principle be shortened by an integer m,where n(1−h(x)) ≥ m. For example a string with 75% of 1’s can be compressedto 0.811n. Such a string of length 100 can be compressed by at least m = 18characters. Let a Martin-Lof test be Test(x) = n(1−h(x)). Test(x) effectivelymeasures the amount the string x can be compressed on this basis and Test(x)is a valid test, as the cumulated probability of all strings where Test(x) ≥ mis ≤ 2−m.

The entropy rate test can be demonstrated by considering a string of length170. (This length is chosen as it is sufficiently long that the entropy rate ismeaningful, but not too long that 170!, which is needed to calculate cummula-tive probabilities, cannot be evaluated on an Excel spreadsheet.) Consider theratio 75:95 ones to zeros. At what level of ǫ would this string be rejected as ran-dom? In this case Test(x) = 1.701 > 1; it can only be compressed to a lengthof 169. The cumulated probability of all strings in the region with m > 1 is0.2499 which is less than 2−1. Thus the 75:95 string would only be rejected asrandom at the m = 1 cut off level i.e. 50 % level. It is still reasonably random.On the other hand, a 70:100 string has Test(x) = 3.8 > 3. Taking m = 3, thecumulative probability of a string being in the region is < 2−3 = 0.038. Thisis the region where ǫ < 0.125. The 70:100 string can be rejected as random orconsidered ordered at the level of m = 3 or ǫ = 0.125. On the other hand, the70:100 string would not be considered ordered at m = 4.

Randomness deficiency as a Martin-L of test

The randomness deficiency d(x|P ) in section 8 can be made into a test. I.e. fora set of strings defined be an enumerable probability distribution P , the test 2

is

δ0(x|P ) = d(x|P )− 1.

If the uniform distribution is considered where P = 1/n, the randomness defi-ciency test assigns strings of length n to a region using δ0(x|P ) = n−C(x|n)−1.As the number of strings with plain Kolmogorov complexity C(x|n) ≤ n−m−1is ≤ 2n−m, the number of strings with δ0(x|P ) ≥ m must be ≤ 2n−m. Puttingthis in probability terms by dividing by the number of strings the result is:

2Readers will note that as C(x|n) < n−m, the ‘-1’ is needed to ensure the relationshipis ≤ rather than just <.


ΣP (x) : δ0(x|P ) ≥ m ≤ 2−m.

This clearly satisfies the criteria of the test but more importantly, as the nextsection shows, this is a universal test for randomness.

The Martin-Lof universal P-test of randomness

It can be shown that there are universal test of randomness that dominate allother tests of randomness. The argument, which is given in detail in Li andVitanyi page 131 [79], shows that all effective tests of randomness can be listedby a systematic procedure. If δy(x) is the yth test, then the universal testδ0(x|P ) with respect to a set with probability distribution P can be chosen tobe

δ0(x|P ) = maxδy(x)− y,where y is an integer. Hence

δ0(x|P ) ≥ δ(x|P )− c.

The universal test is as good as (allowing for the constant) any other conceivableor even inconceivable enumerable test. However some understanding of “asgood as” is called for in this context. If the ordinary Martin-Lof Test impliesthat δ(x|P ) ≥ m . While the string x is in the reject region at the m level,the universal test has δ0(x|P ) ≥ m − c. Thus when δ(x|P ) rejects a string asrandom at m = 4, the test δ0(x|P ) rejects the string as random and considers itordered at the level, i.e. m− c. It is more sensitive in its rejecting randomnessand assigning order. This is saying that, in general, δ(x|P ) requires a higher mvalue than the universal test δ0(x|P ) to recognise order. The universal test willpick out fewer random strings for a given m. However, there is no one uniqueuniversal P -test. Rather there is a number such that each of the universal testscan additively dominate any other. In which case all return the same measureof randomness allowing for the constant. While the universal test may use adifferent value of m to categorise a level of randomness, it still ranks the levelof randomness for all strings in the same order.

The masterstroke in the argument is that the test based on randomnessdeficiency; i.e. the test δ0(x|P ) = d(x|P ) − 1, is a universal test. What ismore it is consistent with the definition of m-randomness outlined in section8.1. I.e. Chaitin [25] has defined randomness as a string x for which H(x) isapproximately equal to |x| + H(|x|) or H(x|n) is close to n. To be specific astring of length n is Chaitin m-random if it can be compressed to a string withlength n −m. I..e. H(x|n) > n −m. This also applies to an infinite string αwhere the test is applied to sequences of the infinite string, i.e. provided that∃m for all n s.t. H(αn) > n−m. It is seen (Schnorr [93]) that this definitionis equivalent to the Martin-Lof definition of randomness at significance level mallowing for the fact the Chaitin definition uses prefix-free complexity ratherthan plain algorithmic complexity.


Evaluating δ0(x|P ) and concluding that x is random using a universal test,is equivalent to establishing that the string satisfies all computable randomnesstests including any not yet devised. However unfortunately δ0(x|P ) is notcomputable.

The sum P -test

Another way of looking at randomness is attributed to L A Levine by Gacs inhis 1988 lecture notes on descriptional complexity and randomnesss [58]. Theapproach, based on Levins maximal uniform test of randomness [57], uses thelevel of payoff for a fair bet as a procedure to recognise a non-typical memberof a set of outcomes (see outline in Li and Vitanyi [79]. Highly non-randomoutcomes are a surprise and ought to generate a high payoff if a bet is placedon such an outcome occurring. For example if a coin is tossed 100 times, thehypothesis is that all outcomes are equally likely. However, if the supposed faircoin happens to be biased 3:1 in favour of heads, in all likelihood there will bethree times as many heads as tails. If each outcome is supposed to be equallylikely, the occurrence of such a surprise outcome should be able to generate areturn with a suitably cunning bet. The critical point is that surprise outcomeswill be more algorithmically compressed than typical outcomes and thereforecan be identified. In order to illustrate the idea, let H(x) be the algorithmicentropy for each outcome x. Make a $1 bet with the agreement that the payofffor the outcome of a (possibly biased) coin tossed 100 times is to be 2100−H(x).If the outcome is random, the payoff would be small as H(x) would be closeto 100, the algorithmic description length of a typical string. But if H(x) issmall for the actual outcome, the measure of surprise is the amount of payoffthe outcome produces. This payoff in effect takes 2−H(x) as the universalprobability distribution m(x) (3.8)

If all outcomes are equally likely the probability for each is p(x) = 2−100.The expected payoff for an unbiased outcome x then becomes

∑

|x|=100

2−1002100−H(x) ≤ 1.

As the expected return for the bet based on tossing a coin 100 times is no morethan the stake, the bet is fair. This is because the Kraft inequality ensures∑

|x|=100 2−H(x) ≤ 1. However if the coin is biased or cheating is taking place,

the returns become significant. E.g. for the particular, case of complete bias,where the outcome is ’hhh . . .’ i.e. 100 heads, and noting thatH(hhhh . . . hh) ∼log2100 = 6. The return of 2100−H(x) for a 1 dollar bet is close to $2× 1028; asignificant sum.

For the less unexpected case, where the coin is biased 3:1 in favour of heads,the string can still be compressed. The algorithmic entropy of a typical string isclose to the log2Ωs where Ωs is the number of members in the set of all stringsexhibiting this ratio (see section 4). There are 2 × 1023 members in the setgiving H(x) = 77. In the 3:1 case, the expected return for is $2100−77 = $8.4million. This return is also not insignificant.


While it is unlikely that the person providing the coin would accept a betwhere the payoff depended on how algorithmic compressible the outcome was,the idea can be formalised to define a randomness test. This payoff test iscalled the sum P -test [57, 58]. Conceptually the approach is to see whetheran apparently random string, is actually random or not by whether a judiciousbetting procedure will generate a positive return. If it is possible to define afair betting regime that generates a positive return for certain outcomes, theremust be hidden structure or pattern in the outcome string. On the other handfor a truly random outcome, no fair betting regime will return an expectedpayoff higher than the stake. Not only does the payoff understanding of ran-domness intuitively make sense, within the sum P-test framework it becomesmathematically robust.

In order to generalise the approach, define δ(x) as a sum P -test whereP again is a recursive or computable probability. In the illustration above,δ(x) = 100 −H(x). This is Chaitin’s degree of organisation (section 1.2) andis a variant of the deficiency in randomness. In general, for the sum P -test,δ(x) must satisfy the payoff requirement (mentioned above) that the expectedreturn is less than one. I.e..

∑

P (x)2δ(x) ≤ 1.

It can be shown that if δ(x) is a sum P -test it is also a Martin-Lof P test (Liand Vitanyi page 257 [79]), but the sum P -test is slightly stronger as the aboveequation implies δ(x) uses self-delimiting coding and therefore satisfies theKraft-Chaitin inequality (3.2). As a consequence, an ordinary P -test, denotedby d(x) will differ from the sum P -test by a term less than 2log2d(x). This isdue to the difference between simple coding and self-delimiting coding.

Similarly a universal sum P -test exists for any recursive probability distri-bution P , not just the simple distribution in the illustration. Let κ0(x|P ) =log2[m(x)/P (x)] = log2[2

−H(x)/P (x)] as the universal distribution m(x) isequivalent to 2−H(x) (section 3.8). The expected return for a bet with pay-offf 2κ0(x|P ) is

∑

P (x)2κ0(x|P ). This is ≤ 1 as∑

2−H(x) ≤ 1. If the returnfor such a bet is signficantly > 1 then the string x cannot be random. Justas log2m(x) additively dominates all other semi measures, κ0(x|P ) additivelydominates all other sum P -tests. For example the uniform distribution Ln ofa string of length n has κ0(x|Ln) = n−H(x|n). (Note that this differs slightlyfrom the P -test based on the randomness deficiency because (1) the sum P -testuses self-delimiting coding to provide the entropy measure H(x|n) and (2) theextra −1 in the deficiency in randomness is not included. I.e. if κ0(x|Ln) isdefined using a prefix-free UTM, it differs by no more than 2log2C(x|n) fromthe randomness deficiency test δ0(x|Ln) = n − C(x|n) − 1. This shows thatthe distance from equilibrium is a robust Martin-Lof measure of order or thelack of randomness of a specific configuration in the natural world. This hasimplications in the discussion as to whether the order in the natural worldindicates design or not.

Gacs’ sum P -test is intuitively appealing, as it implies that if one can win


money by a judicious betting procedure on a set of apparently random out-comes, the outcomes cannot be random. In this case, the degree of order isrelated to the return on a fair bet.

The use of universal concepts

AIT makes use of universal concepts. These are measures that either addi-tively or multiplicatively dominate other measures. In the case of additivedominance, a universal measure gives the same result as any another universalmeasure but the zero level is shifted by a constant. For example, a universalcomputer can simulate any other computer and the length of an algorithm ona universal computer additively dominates all other computers. Other thanthe setting of the zero point, all universal computers are as good as each other(although in practice the measures are more transparent if a simple UTM witha few core instructions is used). Again, the universal semi measure m(x) wasdefined to multiplicatively dominate all other semi measures - it is as goodas any alternative. This is equivalent to the logarithm of the measure addi-tively dominating all other semi measures. As −log2m(x) additively dominatesall measures of algorithmic complexity, this allows one to substitute H(x) for−log2m(x) and vice versa.

Similarly for randomness tests, there is a test that additively dominates allothers randomness tests – it is as good as the best. Different universal random-ness tests generate the same order when strings are listed according to theirlevel of randomness. However the randomness labels may be shifted by a con-stant between different universal tests. As a consequence, the simple deficiencyin randomness provides a randomness measure that is consistent with all otheruniversal randomness tests. In the same way, the self-delimiting equivalentof the deficiency in randomness is −log2[m(x)/P (x)] which is equivalent to−log2P −H(x|P ). This can also be taken to be a universal randomness test.These show that high probability outcomes have low randomness deficiency,whereas low probability outcomes are non-typical.

As has been mentioned, physicists will recognise that the randomness ar-guments apply to an equilibrium configuration in statistical thermodynamicsand a typical string from an equilibrium configuration would be expected topass all randomness tests. However there are always non typical configurationsin the set of possible equilibrium states. For example all the atoms in a Boltz-mann gas could by chance find themselves on one side of the container. As thiswould be an extremely unlikely to occur it would be deemed to be an orderedfluctuation from equilibrium. While such a uctuation could be considered as astate within the equilibrium set of states, it has a negligible contribution to thethermodynamic entropy. Because an ordered conguration is non-typical, oncesuch a conguration has been identied an entropy gradient exists between partsof the system. Such an entropy gradient can be harnessed to extract work fromthe system.

Chapter 9

How replication processesmaintain a system far fromequilibrium

9.1 Introduction to maintaining a system distant fromequilibrium

Brief summary of earlier exposition of AIT

Section 1.2 points out that the length of the shortest, appropriately codedalgorithm that generates a particular string on a UTM defines in bits H, thealgorithmic entropy. The algorithmic entropy can provide insights into thenatural world once it is recognised, as outlined in section 5.2 that the physicallaws driving a system from one state to another are in effect computationson a real world Universal Turing Machine (UTM). As algorithmic entropydifferences are independent of the UTM used, the algorithmic entropy derivedfrom a programme that maps a real world computation by manipulating bitsin a reference UTM in the laboratory, is the same as the equivalent numberof bits derived from the real world UTM. The flow of bits and informationthrough the two systems is the same.

The algorithmic entropy is the entropy of a particular microstate, whereasthe traditional entropies return a value for a macrostate corresponding to aset of microstates. Nevertheless, the provisional algorithmic entropy outlinedin section 4, is the algorithmic entropy measure for a typical microstate in anequilibrium macrostate and returns the same value, allowing for units, as theconventional entropies for the macrostate. Where a macrostate is a homeo-static state distant from equilibrium the provisional entropy corresponds tothe thermodynamic entropy of the macrostate if it is envisaged as a local equi-librium. The provisional entropy is the length of the shortest programme thatspecifies a particular configuration. Because of this relationship, one can easilymove between the algorithmic measure and the conventional entropy descrip-

129

130CHAPTER 9. HOW REPLICATION PROCESSES MAINTAIN A

SYSTEM FAR FROM EQUILIBRIUM

tion. This allows the provisional entropy to be used to track entropy changesand flows as a system evolves over time. In what follows, the world “equilib-rium” strictly refers to a local equilibrium which is stable for a long period oftime. Although, given sufficient time it will degrade further to a global equi-librium. For example, if light from the sun is blocked from reaching a forest,given sufficient time, the forest and its ecology will be reduced to the localequilibrium of mainly carbon dioxide, water, and minerals.

The algorithmic approach provides a measure of the distance a naturalsystem is from its (local) equilibrium in terms of the number of information bitsthat must be added to the system to shift it from an ordered to an equilibriumstate. This measure of distance will be denoted by Dord. This measure isclosely related to Chaitin’s degree of organisation Dorg discussed in section6.2. If there are Ω states in the equilibrium configuration, the algorithmicentropy of a typical string will be log2Ω. Denoting the algorithmic entropy ofthe ordered string by s by H(s), the distance from equilibrium in bits is givenbyDord = log2Ω−H(s).

Tracking the flow of computational bits into or out of a system to changeone state to another, or to produce an equilibrium state can be related tothe entropy flows by Landauer’s principle [74] discussed in section 6. As wasdiscussed, this principle states that where one bit of information in a compu-tational machine, operating at a temperature T , is erased, at least kBT ln2Joules must be dissipated. In a conventional computer this dissipation appearsas heat passing to the non computational parts of the system. Because thecomputation that shifts one real world state to another can be simulated on alaboratory UTM the change in information is independent of the UTM. As aconsequence, where the length of the algorithmic description of the state of asystem changes when ∆H bits are removed, the corresponding thermodynamicentropy change in the real world system is kBln2∆H. Similarly, ∆H bits mustbe returned to the system to restore it to the initial state.

The above points provide a basis for understanding how a natural systemcan maintain itself distant from equilibrium by natural processes. Not only canthe provisional algorithmic entropy quantify the order embodied in a system tocapture its information content, it also provides the thermodynamic constraintsthat govern the emergence of local order. The following sections explore theprocesses of degradation and the requirements to regenerate the initial statesin three natural systems. E.g. the degradation by heat diffusion in a gas, thedissipation of energy in a photo diode, and the ignition of a hydrogen-oxygensystem to form a chemical equilibrium with water (or steam). But first thedegradation of a natural system is explored.

Degradation of natural systems

As a living system operates far from a local equilibrium, it faces real timedegradation that drives the system to a local or global equilibrium under therelentless operation of the second law of thermodynamics (see section 6.3).In thermodynamic terms, the second law ensures that the entropy in such a

9.2. EXAMPLES OF THE COMPUTATIONAL ISSUES AROUND THEDEGRADATION OF SIMPLE NATURAL SYSTEMS 131

system increases as the concentrated energy embodied in ordered structuresbecomes more dispersed and becomes unavailable to do work.1 A living sys-tem must have in-built mechanisms to maintain itself in a far-from-equilibriumconfiguration by offsetting the degradation implied by the second law of ther-modynamics. The mechanisms involve the ejection of disorder and the injectionof order (or alternatively high grade energy) into the system to create orderedstructures. It is when these natural computation processes operating underphysical laws eject disorder that more ordered or low entropy structures areleft behind. One can perhaps see this as a regulation process where the cost ofmaintenance is the cost of restoring the system to an initial macrostate.

The entropy changes of both living and non-living natural systems are con-strained by the same laws. The later section 9.4 provides the general criteriafor any system to be maintained distant from equilibrium. While below, section9.3 deals with some conceptual issues around the reversibility of natural laws.However in order to take this argument further, the later section 9.5 showsthat in contrast to the gas, the photodiode and the hydrogen oxygen system,replication processes can naturally create and maintain far-from-equilibriumstructures in a homeostatic configuration. This allows one to apply the entropyand energy requirements of the process of replication to very simple systemsand carry the initial insights over to biologically more complex living systems.

9.2 Examples of the computational issues around thedegradation of simple natural systems

There are some conceptual issues that need to be resolved to avoid confusionover what order is, and where order comes from. The following three simplerepresentative thought experiments help to clarify the computational issuesand the difficulty in maintaining a system off-equilibrium under the influenceof the Second Law of Thermodynamics. The three processes, described from analgorithmic point of view, are analogous to the second law expansion outlinedin section 6.3. The first and simplest example is where heat is removed fromthe system making the system more ordered. In a sense order is importedfrom the external environment. The second looks at maintaining a photodiodein an excited state, while the third looks at maintaining a hydrogen-oxygenmixture away from equilibrium. The last two examples describe systems whichstart in a configuration where high grade energy is stored to be redistributedthroughout the system over time.

• Consider a gas in a container existing in an instantaneous configurationdistant from equilibrium where a region of the gas is at a higher temper-ature than the bulk of the gas. The difference in temperature betweenthe two regions creates an energy gradient and the second law will drive

1Fortunately for life, the rate at which second law processes drive the system towardsequilibrium is constrained by rates involved in natural processes and barriers such as thechemical activation energy.



the system towards a uniform temperature. From an overall system per-spective, while the specification of the initial configuration shows it isordered relative to the final configuration, as nothing has left the systemthe thermodynamic entropy is unchanged. One can envisage the bit set-tings in the momentum degrees of freedom of the warmer part of the gasas the programme settings that enact the natural laws transferring energyto other gas particles by particle collisions. This process diffuses energyuniformly through the system driving it towards the most probably setof configurations; i.e. the equilibrium set of states. The entropy of theoverall system does not change until excess heat is removed and passedto the environment. If a region of the system is to be permanently hotterthan the bulk, heat would need to be removed once the energy disperses,and compensating heat would need to be introduced, perhaps by radia-tion, to maintain the high temperature region of the gas. I.e. the entropygradient between the system and the environment must be maintainedby introducing heat to one part of the system and extracting heat fromanother. In this example the order is not introduced by the injectionof energy, but is introduced by the removal of heat, or alternatively theinjection of order as cold from the environment.

• The second example of degradation is a photodiode where, initially, theelectrons have been excited from the valence band to the conduction band.The macrostate of the system is defined by the set of microstates consist-ing of the excited electronic states, the vibrational states and other statesof the system. The excitation produces a potential difference between theconduction and valence band that can, in principle, be used to heat thesystem either directly (through collision process as the electrons returnto the valence band), or indirectly by heating a resistor connected to thephotodiode. The photodiode system can also do work if the potentialdifference generated by the excitation is used to drive an electric motor.Whether by doing mechanical work that dissipates heat, or producingheat directly without work, if left to itself the system will undergo a sec-ond law degradation process and settle in a local equilibrium where mostelectrons return to the valence band increasing the temperature. Storedenergy, as programme bits in the conduction band, have diffused throughthe system to increase the bits in the momentum degrees of freedom andhence the temperature.

In principle, the photodiode can be maintained off-equilibrium in a fluxof photons that continuously repopulatex the conduction band, providedthat the excess heat generated by the process is removed by passing heatto the environment.

• The third representative case is where a stoichiometric mixture of hydro-gen and oxygen burns to form either liquid water or steam, dependingon the temperature and pressure. After ignition, the system moves from

9.2. EXAMPLES OF THE COMPUTATIONAL ISSUES AROUND THEDEGRADATION OF SIMPLE NATURAL SYSTEMS 133

the initial configuration and will settle in chemical equilibrium where

O2 + 2H2 2H2O + heat.

If the system is isolated, most of the hydrogen and oxygen will havebeen converted to H2O, and the temperature of the local equilibriumwill increase. What has happened is that stored energy has been releasedto form water and heat. However, this process is reversible and in itselfdoes not generate order. I.e. given sufficient time the system wouldrecycle to the initial state. It is only when heat is lost does the remainingsystem become more ordered. Once heat is lost, and if the pressure issuch that sufficient liquid water forms, the system can be restored tothe original configuration by losing heat and using external energy toregenerate hydrogen and oxygen by, for example, electrolysing the water.This is unlikely to happen naturally.

These examples highlight an important point, the entropy of an isolatedsystem does not reduce, even if a chemical reaction takes place. In the first ofthe above bullet points, raw energy was originally introduced as heat to createthe entropy differential between the system and the environment. While theentropy differential between parts of the system can be harnessed to do work,ordering only occurs (i.e. the entropy is lowered) when disorder as waste orheat is ejected. Similarly when the components in the hydrogen-oxygen systemignite to become water + heat the entropy is the same. What has happened isthat programme bits embodied in the stored energy have computed a new statewhere bits are distributed through the momentum states. The critical factoris that diffusing energy through the system increases the temperature creatinga situation where excess entropy can pass to the environment. Ordering onlyoccurs when this disorder is ejected, often as waste species or as heat, leavinga more ordered structure behind. In a sense one can envisage this as the coolerand therefore more ordered environment injecting order into the system.

Maintaining a system off-equilibrium corresponds to maintaining an entropygradient between the system and the equilibrium set of states. The physicallaws that determine the trajectory of a real world system through state spacecan be envisaged as the result of a computational process operating underalgorithms embodied in these laws. As the thermodynamic cost of replacingthe lost information becomes the cost of maintaining the system off-equilibrium,AIT provides a tool to inquire into the emergence and maintenance of order inthese real world systems.

System processes and system degradation

However most systems, like the inert systems described, can only maintainthemselves in the initial set of configurations under special circumstances. Theycannot self-regulate, as to do so requires an autonomous process to replace thelost information. The photodiode system can only be sustained in a stableset of configurations by accident, as for example, happens when the system



exists in a fortuitous flux of photons that excite electrons into the conductionband, and the system can eject the sufficient excess heat to maintain the energybalance. Similarly, the hydrogen-oxygen system would require an unlikely set ofcircumstances to be sustained off equilibrium. An inert earth (without life) willreach a steady state where the surface temperature is such that the radiationfrom the sun exactly balances the heat radiation from the earth. The inertearth as a system is broadly analogous to the photodiode system. The steadystate situation that emerges when energy inputs and outputs balance is nottrue regulation.

Contrast this with a living cell. A living cell can actively maintain itselffar-from-equilibrium by accessing nutrients and ejecting disorder as waste. Ashas been mentioned, Landauer’s principle [74] showed that the thermodynamiccost of maintaining a non-equilibrium system requires the lost information inbits to be replaced and excess bits that have entered the system to be ejected.Much of the self-organising capability of life on earth, or of systems of cells,would seem to be a consequence of evolutionary processes that lead to thecapability of structures to replicate. The argument developed below showsthat the ability of the earth to self-organise is a consequence of the replicationcapability of the living structures on earth. While inert systems can maintainstable configurations, replication processes provide a natural mechanism fortrue regulation.

9.3 Entropy balances in a reversible system

Equation 3.20 shows that in general the algorithmic entropy of the output o ofa computational process may be less than the sum of the algorithmic entropiesof the initial state i and the conditional entropy H(o|i∗). This happens wheninformation is discarded in going from i to o. In which case, o is said to beredundant with respect to i (see Zurek [114]). Of more interest in this sectionis the situation where a reversible isolated system moves from an apparentlyordered state to a more disordered state as outlined in Section 6.3. However,there seems to be a paradox as the entropy cannot increase in an isolated sys-tem. Section 6.3 shows that the apparent entropy increase in the output stateis introduced into the system by the initial bit settings in the programme statewith its halt setting. Strictly the true initial state is (ph, s0). This includesthe observed initial state s0 and the stored programe state ph. Furthermore ifboth sets of bits are tracked there is no entropy change for an isolated system.In order to avoid ambiguity, when the provisional entropy of a configurationidentifies order in a macrostate distant from equilibrium, one can say the con-figurational of a microstate in the macrostate is more ordered than a typicalor equilibrium state. But, if programme bits are considered what might betermed an ordered fluctuation is just an improbable state or set of states in thefull equilibrium set.

Once the bit setting of the programme, ph that drive the system to thehalt state are considered as part of the initial configuration, it is seen that

9.4. HOMEOSTASIS AND SECOND LAW EVOLUTION 135

H(ph, s0) = H(sh, ζ), where H(sh, ζ) is the entropy of the halt configurationincluding the history of the computation ζ that is needed to maintain reversibil-ity. This understanding is implicit in the discussion by Zurek [113] and Bennett[9, 11]. Here the notation ζ is used for the history information rather than themeasure g in Section 6.3, as here the history is not considered as garbage, butas being critical to the way the system evolves. If however all the information isaccounted for in a reversible process, an equals sign is appropriate in Equation3.20 as there can be no entropy change. The equivalent of Equation 6.4 for thereversible situation where all entropy changes are tracked, then becomes;

H(sh, ζ) = H(s0) +H(shζ|s∗0). (9.1)

This point is used to describe the entropy requirements of a homeostasis in thenext section 9.4.

For example, as is discussed in more detail earlier, the configurational statesrepresenting a stoichiometric mix of hydrogen and oxygen are ordered, as theenergy is concentrated in the initial states of the system. There is no entropychange to the total system and, in principle, the system could revert back tothe original state. Reversibility is only lost when information corresponding tothe history ζ is lost, as happens when excess heat is ejected from the system.In which case, the system’s entropy decreases as the disorder in the momentumstates is reduced.

9.4 Homeostasis and second law evolution

Previously Zurek [114] has shown that where a system changes from the initialstate i to the output state o by an external disturbance, corresponding to achange of H(i) − H(o) bits, the loss or gain in bits needs to be reversed bya regulation process acting from outside the system. The approach requiresan external controller to injects bits to maintain the system. In contrast, thissection looks at the requirements for autonomous regulation to counter secondlaw effects when a system is distant from equilibrium but is being driven to-wards equilibrium. I.e.the requirements for the regulation processes to operatefrom within the system itself, rather than be imposed externally. Autonomousregulation is essential if a natural living system is to be maintained in a viableregime far-from-equilibrium. The argument here is that autonomous regula-tion is not a fortuitous outcome of complex interactions of natural laws but, atleast in simple cases, and perhaps in many cases, is accomplished by replicationprocesses ([48]).

Before looking at replication processes as such, the thermodynamic infor-mation requirements for a system to maintain homeostasis are outlined.

Information requirements of homeostasis

As the provisional algorithmic entropy defines the common entropy of eachmicrostate in a set of microstates with the same characteristics, the formalism



can be used to determine the algorithmic processes that maintain a real worldsystem in a homeostatic state distant from equilibrium. Such a system ismaintained in a stable region of the state space by the inflow of information andthe outflow of waste while keeping the number of bits in the system constant.Two issues arise;

• How can such a homeostatic macrostate be sustained by a natural pro-cess when the second law of thermodynamics is driving the system toequilibrium – the set of most probable microstates?

• What is the algorithmic entropy of a microstate in homeostatic systemdistant from equilibrium?

The arguments in these sections follow Devine [48]. Homeostasis implies thatthe system’s provisional entropy must not change, and therefore the systemmust be maintained by a regulation process in an attractor that provides aviable macrostate. The provisional entropy, allowing for units, represents thethermodynamic entropy of the macrostate as a local equilibrium. Changesin the provisional entropy can be used to track the thermodynamic entropyflows in and out of the system. In order to see how a stable or homeostaticset of far-from-equilibrium configurations can be sustained under the influenceof the second law of thermodynamics, we first define the viable region of thesystem. This will be taken to be the macrostate E consisting of a set of far-from-equilibrium microstates where the ith microstate is denoted by ǫi. Eachof these microstates specifies a configuration in the system’s state space andall have the same thermodynamic entropy (given the particular resolution ofthe state space grid). All the microstates in the macrostate form a set offar-from-equilibrium configurations.

Section 4.1, showed the provisional entropy of a microstate consists of theequivalent of the Shannon entropy of the set of all states, and a term that definesthe structure common to all the microstates in the set. Furthermore, allowingfor units, the thermodynamic entropy of the macrostate as a local equilibrium,returns the same value as the provisional algorithmic entropy of a typical mi-crostate ǫi (i.e. the overwhelming majority of states) in the macrostate. Asa consequence, changes in the provisional entropy can be used to track theentropy flows in and out of the system. In its viable region, the system movesthrough an “attractor-like” region of the state space where each microstate ǫihas the same provisional entropy 2. This attractor-like region maintains itsshape as it drifts through a hypercube where the labels on the variables, butnot the type of variables, are changing. I.e. over time, one carbon atom willbe replaced by another.

However, if a system is originally in a microstate ǫi belonging to the far-from-equilibrium macrostate, second law processes, acting within the system,shift the computational trajectory away from the viable macrostate E. The

2I don’t think this is a true attractor as the labels on the variables change. I.e. over timeatom i is replaced with another identical atom j.


system can only be maintained in the viable region if the second law degra-dation processes can be countered by accessing the stored energy embodied inappropriate species (such as photons, atoms, molecules or chemicals) where thestates are represented by the string σ and, at the same time, ejecting disorderor waste represented by the string ζ. Alternatively, work can create an entropygradient, as happens for example when a heat pump produces a low entropysink to absorb excess heat3.

In a homeostatic set of states, the system moves from one state in the viableregion to another as entropy and replacement species enter the system, whileheat and degraded waste leave. A snapshot of a particular state in the home-ostatic stable regime will show a mix of structures; some highly ordered, somethat are partially disordered, and some that can degrade no further. For thealgorithmic entropy of a typical state in the homeostatic set of configurationsto remain constant, the history of the computation that increases at each stepmust continuously be ejected as waste otherwise bits would build up in thesystem.

The restoration process that counters the second law

The second law process that degrades the species, and the restoration compu-tation that re-creates order, are best seen as two separate processes. Considerthe first stage of the process as the second law disturbance that drives thesystem away from the viable region towards a non-viable configuration ηj that,to maintain reversibility, also includes the states capturing the history of thecomputation implied by the disturbance. This disturbance degrades the sys-tem’s structures. From a computational point of view this disturbance can berepresented by the algorithm p∗2ndlaw, the shortest algorithm that implementssecond law processes to shift the system to ηj . In which case, the natural pro-cesses are equivalent to the following reversible computation on the referenceUTM [48].

U(p∗2ndlaw, ǫi) = ηj . (9.2)

As the maintenance process never halts, the algorithmic entropy is determinedby bit flows, and an instruction to link routines that would be needed in thestatic case is not needed here as discussed in section 5.2. As nothing leavesthe system at this stage, the algorithmic entropy ηj of the system following thesecond law process is given by,

H(ηj) = Hprov(ǫi) + |p∗2ndlaw|. (9.3)

As in principle, the degradation is reversible, bits are conserved and providedall the history bits are taken into account in ηj , the equals sign is appropriatein Equation 9.3. I.e. the change of H(ηj) − Hprov(ǫi) bits comes from thebits in p∗2ndlaw. Without the ability to eject disorder, and without access toconcentrated energy or low entropy sources, the system would degrade and over

3As any machine requires an entropy gradient to do work, work is just another way oftransferring order from one system to another via this entropy gradient.



time reach a local or with more time a global equilibrium under the second lawof thermodynamics. What might be termed a regulation or restoration processcounters the effect of the second law disturbance by the restoration algorithmp∗rest that acts on an input resource string σj specifying the species containingthe stored energy and the string ηj specifying the non viable configuration.The programme p∗rest also contains the programme components embodied inthe physical laws carried with the input string σ. This process correspondsto a computation on the input string σj , that describes the species containingthe stored energy, together with the string ηj that specifies the non viableconfiguration.

The net output for the restoration process is the final microstate ǫk with thesame provisional entropy as ǫi and with an extra string ζl that is generated withǫi. In general a computation like prest generates ǫk and ζl together and thereforethere is a high level of mutual information between the two components of theoutcome. From Equation 9.2, this restoration process is represented by thecomputation;

U(p∗rest, ηj , σj) = ǫk, ζl. (9.4)

From Equation 9.4,

Hprov(ǫk, ζl) = H(σj) +H(ηj) + |p∗rest|. (9.5)

Equations 9.3 and 9.5 can be added, and after rearranging it is seen that theprovisional entropy of the system has increased by;

H(waste) = Hprov(ǫk, ζl)−Hprov(ǫi) = |p∗2ndlaw|+ |p∗rest|+H(σj). (9.6)

This increase, denoted by H(waste), is the number of bits that need to beejected for the system to be restored to the original macrostate. That is itis the requirement for a system to self regulate. As one would expect, theentropy of the waste string must include the bits embodied in the second lawdisturbance algorithm, the bits in the regulation algorithm, and the bits thathave entered the system to defining the microstate of the species carrying theconcentrated source of energy. All this additional entropy must be ejected ifthe system is to be returned to the initial macrostate.

This can be related Zurek’s approach (discussed in section 6.2), which ar-gues that, as H(ηj) − H(ǫi) = |p∗2ndlaw|) bits shift the system away from thestable configuration, the same number of bits needs to be returned to the sys-tem if it is to return to the original macrostate. In contrast to the Zurekapproach in the present situation of autonomous regulation, the bits in thecomputations p∗rest and σj involved in the regulation or restoration processare part of the system and need to be accounted for. Furthermore, Equation9.6 implies that, for autonomous regulation, the net decrease in bits due tothe regulation process must equal the increase in bits that enter through thedisturbance if bits are not to build up in the system. I.e.

H(waste)− |p∗rest| −H(σj) = |p∗2ndlaw|.


However, the bits that are ejected from the system are more degraded in thesense that much of what were originally programme bits have become bits asso-ciated with the momentum states and correspond to an increase in temperature.A simple example is where an external low temperature sink extracts heat fromthe system. The low temperature sink is more highly ordered than the system.The absorption of heat or waste from the system by the action of physical laws,is in effect a transfer of order into the system from the environment. The aboveargument also implies that the algorithmic entropy specifying a microstate ina homeostatic system is related to the entropy in, less the entropy out. Section5.2 specifically addresses this issue using an argument closely related to thatin this section.

Equation 9.6 shows that the ability of the system to maintain homeostasisdoes not directly depend on Hprov(ǫk), and therefore the system’s distancefrom equilibrium. Rather, as is discussed by Devine [48] and implied by therequirements of equations 9.19, the rate at which the second order degradationprocesses can be countered by the restoration process depends on [48]:

1. the rate at which the concentrated source of energy can enter the system,and

2. the rate at which excess entropy can be ejected from the system.

These two requirements determine the entropy balance for the far-from-equilibriumset of configurations available for the system. The greater the rate that thethe system degrades, the further from equilibrium the system is likely to bein order to eject sufficient waste such as heat. Also, the further the system isfrom equilibrium, the greater its capability to do work, as this depends on theentropy difference between the system and its surroundings.

As the second law programme embodied in the natural laws shifts the tra-jectory out of the homeostatic set of configurations increasing the entropy by|p∗2ndlaw|, any regulatory or restoration programme, embodied in natural laws,such as a replication process, must turn the excess entropy into waste thatcan now be discarded in a manner similar to that described in section 6.2.As Equation 9.6 implies, bits move in and out of the system during the reg-ulation process. Many natural systems, such as the photodiode mentionedabove, cannot self-regulate unless there are fortuitous circumstances or out-side interventions. The hydrogen-oxygen-water mix mentioned earlier can bemaintained if hydrogen and oxygen resources are available to feed into the sys-tem, while excess water and heat is ejected. But as heat input is needed totrigger the combustion process, it is not a replication. In general however, inorder to regulate, the system must be able to sense or predict the deviationfrom the homeostatic configuration and implement a correction process [3]. Inorder for a natural system to truly regulate, excluding human or animal cogni-tive responses, natural regulation processes must be wired into the system insome way. In a later section 9.7 it is argued that replication processes behavelike a primitive form of regulation that does not require intelligent externalinterventions, an observation that provides fundamental insights into living



systems. True replication processes are able to naturally restore a system to ahomeostatic state as, given adequate resources, the drive to create or maintainorder is inevitable. Because replication drives a system towards a more orderedconfiguration, when degradation occurs, natural replication processes, seek toaccess external resources to create or maintain the order. As is discussed inmore detail below, this is why the production of coherent photons in a laser,the alignment of spins, or the replication of cells in a broth of bacteria, con-tinue far-from-equilibrium. Such replicating systems maximise the numbersof replicated units in a given environment. In so doing they counter the sec-ond law degradation provided resource inputs can be accessed and waste canbe ejected, as shown in Equation 9.6. These replicating systems settle in anattractor-like region of the state space with the same provisional entropy. Asresources flow through the system replicas are created and destroyed. Outsidethe viable attractor region, the trajectory is highly sensitive to its initial stateand the system behaviour is chaotic. However, all the states in the attractorregion have virtually the same provisional entropy. The capability of replicat-ing systems to regulate suggests that it should be no surprise that replicationprocesses underpin the maintenance of living systems far-from-equilibrium.

9.5 Replication processes to counter the second law ofthermodynamics by generating order

The argument developed below is that replication processes, which generatealgorithmically simple and therefore low entropy structures, are a major mech-anism in generating and maintaining a living system distant from equilibrium.As replication processes are shown to be core to many natural systems, under-standing replication is important to understanding these systems, the followingpoints are explored throughout this section.

• The critical point is that the algorithmic entropy of a system of replicatedstructure is low. Relatively few bits are required to specify such a systemas the algorithm only needs to specify the structure once and then copy,or generate, repeats of the structure by a short routine.

• Replication processes can create and maintain an ordered system awayfrom equilibrium in an attractor-like region of the system’s dynamicalstate space where all the states have the same provisional algorithmicentropy.

• Variation in a system of replicated structures leads to an increase in thealgorithmic entropy. However, variation also provides a natural mech-anism to stabilise the system against change by allowing the system toevolve to a more restricted region of its state space. In other words, vari-ation can maintain the system in a stable configuration through adaptiveevolutionary-like processes. AIT provides a reasonably convincing argu-ment that, in many situations, those variants that persist in a changing

9.5. REPLICATION PROCESSES TO COUNTER THE SECOND LAWOF THERMODYNAMICS BY GENERATING ORDER 141

environment are those that use resources more efficiently and have lowerentropy throughputs.

• Coupled and nested replicator systems create greater stability againstchange by co-evolving as, from an entropy perspective, the replicatedstructures use resources more efficiently. This efficiency, called here ‘en-tropic efficiency’ would seem to be an important constraint on evolu-tionary selection processes. Nevertheless, while each replicated structureis efficient in this sense, the number of interdependent or nested repli-cated structures increases to ensure that overall, the high quality energydegrades more effectively than would otherwise be the case [92].

Replication as a processes that creates ordered structures

As has already been argued, Algorithmic Information Theory indicates thatreplication processes are key to generating much of the observed local orderaround us. In contrast to alternative structures where each structure must bespecified independently, a system of replicated structures is more ordered asthe algorithm that generates the system is more compressed. As fewer bits ofinformation are required to describe the structure and, as replication occursthrough natural laws, provided resources are available and high entropy wastecan be ejected, replication processes can both create and maintain order. Insuch a situation, replicated structures are more likely to emerge than similarstructures produced by other natural process. Indeed, it seems that manynatural systems rely on replication processes as a means of self-organisation andself-regulation. Given one replicator, others will form and a system of replicatedstructure can be maintained in a way that compensates for the second law ofthermodynamics.

This can be seen by considering a system of N repeated replicated struc-tures, that can be generated by a short algorithm of the form “Repeat replicatedstructure N times”. Alternatively, in the case of a living structure such as acell, the algorithm is of the form “Replicate structure N times” where the in-struction is embodied in the DNA in the cell. The algorithmic entropy of thisalgorithm corresponds to;

Halgo(system) = |N |+ |replicator description|+|replicate routine|. (9.7)

Relative to the large number of replicas that eventuate and given commonnatural laws, the replicator description and replicator routines are short. Thealgorithmic entropy is dominated by the description of N which only requiressomething like log2N bits. The system of replicated structures is ordered com-pared to a similar set of non-replicated structures where the algorithm needs



to specify each of the N structures separately, taking at least N bits, ratherthan log2N

4.

The process of replication is like a chain reaction that creates repeats of aparticular structure using structural or stored energy. While replicated struc-tures accumulate in the system, other physical processes, such as collisionsor chemical reactions, driven by the second law of thermodynamics processes,will destroy these structures. However, replication processes that access highquality energy and eject disorder, are able to restore such a system to theoriginal ordered set of configurations. The increase in order due to replicationonly occurs when the disorder appearing in the states that capture the historyof the computational process is ejected allowing the replicating structures todominate the system. Homeostasis requires that there is no overall entropychange. In general, replication involves physical or biological structures thatreproduce by utilizing available energy and resources and, at the same time,ejecting excess entropy as heat and waste to the environment.

Physical examples of such replicating systems are a crystal that forms froma melt, a set of spins that align in a magnetic material, and coherent photonsthat emerge through stimulated emission. Biological replication including anautocatalytic set, bacteria that grow in an environment of nutrients and thegrowth of a biological structure such as a plant involve more complex pathways.In the last example, as different genes are expressed in different parts of theplant, variations of the basic structure emerge at these points.

The characteristic of a replicator is that, in a resource rich environment,the probability that a structure will replicate increases with the number Nof existing structures. Consider a classical physical system, initially isolatedfrom the rest of the universe where the instantaneous microstate of the systemis represented by a point in the multidimensional state space. Over time thestate point of the system will move through state space under the operation ofphysical laws. As the probability of replicated structures appearing increaseswith their occurrence, replication will drive the state point to a region of statespace dominated by a large number of repeated replicated structures. In thiscase replicated structures are more likely to be observed than alternatives.

Consider the situation where the probability of the appearance of a repli-cated structure is proportional to Nx (where x > 0), then dN/dt ∝ Nx [101].For this to happen it is assumed that entropy as waste products or as heat canbe ejected so that entropy and waste does not build up within the system. Forexample, given a double helix of DNA in the right environment, the probabilityof a second double helix appearing is comparatively high.

The focus here is on true replication processes where the presence of thereplicated structure triggers more replicas.5 Where resources are limited, the

4Note that here the word “replica” will be used to denote those physical or biologicalstructure that use external energy and resources to reproduce themselves. This means thebiological cell is the replicating unit, not the DNA information string that is copied as partof the replicating process.

5True replication requires that the presence of one unit creates a chain reaction to createmore units. This contrasts with processes such as the combustion of oxygen and hydrogen,

9.6. SIMPLE ILLUSTRATIVE EXAMPLES OF REPLICATION 143

number of replicated structures grows over time until a state of homeostasisis reached; a state where the set of replicators and the environment reach along-term stable relationship.

A schematic representation of this process is the logistic growth equation

dN/dt = rN(1−N/K). (9.8)

This equation, while trivially simple, captures the essence of replication. Herer represents the usual exponential growth rate of replicated structures andK represents the carrying capacity of the system. In the homeostatic statereplicas may die and be born. dN/dt = 0) and the system is maintained in ahomeostatic state by the flow through of nutrients and energy and the ejectionof waste. For some simple physical processes this waste can be ejected as latentheat, as happens when the vibrational states of the system lose heat to theenvironment. For example when spins become aligned in a magnetic material,disorder is shifted to the vibrational states (i.e. the temperature rises). Thevibrational states, which become more disordered, capture the history or thememory of the computational process. Unless the corresponding heat can beremoved, the system remains reversible and the spins will in time revert back toa more random configuration. Alternatively the waste might be relatively lessordered photons, or degraded chemical products that exit the system. The rateat which the system can access resources and eject waste determines where thishomeostatic balance occurs. This is the situation where bits remaining in thesystem are conserved. The replicated structures create ordered structures andshift compensating disorder elsewhere to be ejected. In essence, replicationprocesses use natural laws to self-regulate to maintain the system far-from-equilibrium even though the replicated structures degrade under the secondlaw of thermodynamics. Simple physical replication processes can be used toillustrate the key characteristics of natural ordering through replication.

Many living systems also need to do work, increasing the thermodynamiccost of maintaining the system away from its l equilibrium. In order to fly,birds do work at a greater rate than mammals or reptiles. As a consequence,their body temperature is higher. While systems further from equilibrium havea greater capacity to do work, they need to be more efficient in their use of theresources that are inputs to the computational process and in ejecting disorderrepresented by the output string that exits the system. As is discussed later,evolutionary type processes acting on a system of replicated structures selectvariations that are more efficient in this sense.

9.6 Simple illustrative examples of replication

replicating spins

In contrast to the photodiode system or the hydrogen and oxygen system dis-cussed above, replication process are naturally ordering processes. While many

where the heat of combustion, not the presence of water, triggers the process.



replicating processes occur in living systems, much can be learned about theprocess by considering two simple physical examples - the alignment of mag-netic spins (i.e. magnetic moments) below the Curie temperature and thereplication of photons by simulated emission. In the second example, stimu-lated emission provides ordered repeats of a structure by passing disorder to theelectronic states of the atoms and then to thermal degrees of freedom. Theseexamples are discussed in some detail below to clarify the entropy flows in anordering process.

For those unfamiliar with magnetic systems, in what follows the word “spin”refers to the magnetic moment of a spinning electron. One can envisage thisas a micro magnetic associated with a magnetic atom such as an iron atom.These micro magnets might all be aligned in which case one says “ the spinsare aligned” etc. Above the Curie temperature an aligned system of spins willbecome more disordered as thermal energy passes into the spin system disor-dering it. If the disordered system is isolated, it will settle in an equilibriumconfiguration where the spins are only partially aligned. It is only when thetemperature falls below the Curie temperature, with ejection of latent heat, willfull spin alignment occur as ferromagnetic coupling between spins will domi-nate the thermal energy. Order has increased by discarding heat to an externalcolder environment. The ideas can be illustrated by a simple model - the or-dering of a 1-D chain of spins. This model not only illustrates replication, butalso is informative about what happens to the information during the orderingprocess.

Let the configuration of the partially coupled 1-D spin chain be specifiedby a string of N disordered or randomly aligned spins denoted by x wherex is either 1, representing a spin up, or 0 representing a spin down. Let thevibrational states be below the Curie temperature and let the representativestring be approximated by a string of zeros. As the vibrational states are at alow temperature there is an entropy gradient between the spin subsystem andthe vibrational subsystem. The ordered vibrational configuration can be repre-sented schematically by an ordered string of N zeros. The initial configurationi consisting of a random set of spins and the ordered vibrational states is thenrepresented by,

i = 1xxx . . . xx000000 . . . .

The first character is a ‘1 and represents the particular spin that will be repli-cated in the sense that, under the right conditions, magnetic forces betweenspins will ensure the next spin is aligned with the first and so on. From acomputational perspective, the programme scans the string representing thespins finds a 1 and effectively replicates the alignment by ensuring the nextspin is also a 1. Spin alignment is a spontaneous symmetry breaking process.For the ordering to continue, information (i.e. bits) carried by the programmestates, must be transferred to the 00 . . . 00 vibrational states, disordering thevibrational degrees of freedom. This effectively raises the temperature of thevibrational degrees of freedom. While the system is isolated no entropy is lost


and the vibrational states retain the memory of the physical process maintain-ing reversibility.

If U1 is taken to be the computation that maps the real world computationaligning the next spin with the current spin. The overall ordering process canbe envisaged as a repeat of this alignment process on the initial string i downthe spin chain. In which case, the output after N steps is:

(U1)N i = o = 11111 . . . 11xxxxxx . . . xx. (9.9)

As was mentioned, the randomness has been transferred to what previouslywas the ordered section of the string representing the vibrational coordinates.If the system is below the Curie temperature after the energy transfer, andremains isolated, the system settles in a stable set of aligned spins. Strictlyeach aligned configuration is one of an equilibrium set of similar configurations,with the odd rare fluctuation that mis-aligns spins. If isolation is removed andthe temperature rises, in order to re-order the system, disorder as heat mustto be ejected and passed to a low temperature external sink, but at the cost ofincreasing the entropy of the rest of the universe.

For large N , ignoring computing overheads, the algorithmic entropy H(i)ofthe input string information content of the input string is given by;

H(i) ≈ (N + log2N) (specifying the random 1xxxx..x)

+log2N + |0| (Specify 0, N times). (9.10)

Here the representation of the size of |N | is N+log2N with the log term to par-tially allow for self-delimiting coding. The output string can be specified by asimilar algorithm but where this time the spins are ordered and the vibrationalstates disordered. I.e. the algorithmic entropy becomes algorithm

Halgo(o) ≈ log2N + |1| (aligned spins)

+N + log2N + |0| (disordered vibrational states). (9.11)

At this stage, as nothing is lost from the system reversibility is implicit inthe history of the computation embodied in the final state. Hence Halgo(o) =Halgo(i). Once the vibrational states lose heat to the external environment thealgorithmic entropy is lowered and the output becomes;

o = 11111 . . . 1100 . . . 00.

The algorithm that shifts i to o is implicit in the bit settings of the statei as these carry the algorithm in the physical laws that are embodied in thegeneralised gates of the real world system. I.e. the instructions UN

1 are implicitin the original state.



Coherent photons as replicators

Another simple example is the replication of coherent photons through sim-ulated emission. Consider a physical system of N identical two-state atomswhere a photon is emitted when the excited state returns to the ground state[46]. The emitted photons can either be in a coherent state, or a set of inco-herent states. The state space of the system initially consists of;

• the position and momentum states of the atoms;

• the momentum states of the photons;

• the excited and ground electronic states of each atom.

Assume the system of photons and particles are contained within end mir-rors and walls and no energy loss occurs through collisions with the container.The initial ordered configuration will be taken to be the N atoms existing intheir excited states with the photon states empty. If the lifetime of the ex-cited state of the atom is short, only incoherent photons will be emitted, andthe system will settle in an equilibrium consisting of excited and ground stateatomic states. These incoherent photons with momentum vectors scatteredover all directions, will not be significantly ordered. In addition, under secondlaw processes, some excited atoms will return to the ground state via colli-sions or other processes before emitting a photon and in so doing increasingthe thermal energy of the system. Ultimately the system will settle in a localequilibrium at a higher temperature with most atoms in the ground state.

On the other hand if the lifetime of the excited states is long stimulated pho-tons becomes possible. This can be envisaged as replication processes involvingin the first instance a single photon state. Once a stray photon replicates, pho-tons will continue to replicate until there is a balance between the number ofexcited atoms, the number of coherent photons and the very small numberof incoherent photon states. A symmetry breaking process takes place as thestimulated photons have aligned momentum vectors, rather than being ran-domly oriented over all directions. While the system is isolated, each photonreplication is reversible and a coherent photon may be absorbed, returning anatom to its excited state. Photon replicas die and are born as the system settlesin an attractor-like region of its state space. The physical computation processfor an isolated system is both thermodynamically and logically reversible.

The algorithmic description of an instantaneous configuration of this iso-lated system after a period of time will consist of the description of the atomicstates; a description of the state of each incoherent photon; the descriptionof the coherent photon state (consisting of replicas of the original stimulatingphoton); and the momentum states of the atom which are deemed to be ran-dom. Let the electronic states of the system be represented as a string where a‘1’ represents an excited atom and a ‘0’ represents the ground state. The inputstring, i representing the initial highly ordered system of excited atoms withno photons takes the schematic form:


[Atom electronic states] [Photon states]i=[1111111111111111111] [empty].

We will assume the system is kept at a constant temperature, as it allows themomentum or vibrational states of the atoms, and the atomic positions can beignored as only the change in algorithmic entropy is relevant.

Initially all the atoms are in their excited state with no photons. If stimu-lated emission is possible and if the system is isolated, after a period of time,a photon will be emitted and this will stimulate the emission of other coherentphotons. Similarly incoherent photons may also be emitted. Over time thesystem will consist of a mixture of atomic states and n photons in a coherentstate. The algorithmic description of an instantaneous configuration of the rel-evant total system will consist of the description of the atomic states in termsof whether they are excited or in the ground state and the description of thecoherent photon states which are copies of the original stimulating photon. I.e.

[Atom electronic states] [Coherent photon state]o=[0011111011011010110] [n] .

The n coherent photon state is more ordered than the incoherent states as themomentum vector is no longer randomly oriented. The memory of the pho-ton creation process needed to maintain reversibility passes to the momentumstates of the atoms. In practice this memory disappears when heat is exchangedwith the environment. If photons do not replicate, the system will tend to anequilibrium as stored energy is passed to the momentum states of the atomsanalogous to a free expansion. However, photon replication generates order byproducing the coherent photon state once entropy is passed to the environment.

Nevertheless, if photons can escape, the system will need to access an exter-nal energy source to re-excite some of the atomic states to compensate for thephoton loss. The external source that excites the atoms could be an electricaldischarge or a flash lamp. These examples, while seeming trivial capture someof the key conceptual issues in understanding ordering from an AIT perspec-tive. If the physical laws are embodied in programme p that runs on UTM U ,the process can be envisaged as U(p, i) = o. As the above examples illustrate,in general a replication process creates order by passing the disorder elsewhere;effectively partially randomizing other degrees of freedom. If the system isisolated for long periods, the system is reversible and will settle in a region ofstate space where replicated structures die and are born as the total systemmoves through possible configurations. However, as no system can truly beisolated, eventually the order will be destroyed by losing photons and heat todepopulate the excited atomic levels. In which case the system can be main-tained in an ordered configuration by exciting the atomic levels with a flashlamp or a gas discharge and byremoving excess heat.

The process of maintaining the system the system in such a configurationis discussed in more detail earlier in sections 6.2 and 9.4.



9.7 The entropy cost of replication with variations

The previous sections considered the situation of identical replicated structures.However, many real world systems consist of near identical structures. Forexample the expression of a cell in the eye is not identical to one in the liver yetboth contain the same DNA coding. Depending on the source of the variationin the replicated structures, variation introduces an entropy cost that can bequantified by the Algorithmic Information Theory approach. In the case of nearidentical replicated structures, the provisional algorithmic entropy identifiesan entropy increase with the uncertainty arising from the variations. As aconsequence, the algorithmic description is longer.

A simple example is the string formed by replicating the structure ‘10’otoproduce the string 101010...10. Variation can be introduced by allowing both‘10’ and ‘also 11’ to be considered as valid variants of the core replicatedstructure. This string then becomes the noisy period-2 string discussed insection 4.4, where every second digit is random. Thus a valid system of repli-cated structures would now be any output string of length N having the formo = 1y1y1y1y..1y, and where y is 0 or 1. There are Ωs = 2N/2, members of theset and as section 4.4 shows that the provisional entropy of the variants is;

Hprov(o) ≃ N/2 + log2(N/2) + |1|+ |0|, (9.12)

ignoring higher “loglog” terms.In general, the provisional algorithmic entropy, which is given in section

4.1, includes the length of the code that identifies the string within the set of2N/2 variants, together with the specification of the pattern, or the model, thatidentifies the set itself.

By comparison, the provisional algorithmic entropy of the string o = 101010...10,a string which represents no variations in the replicated unit (i.e. 10 not 1y),has the algorithmic entropy given by;

Hprov(o) = log2(N/2) + |10|. (9.13)

The increase in the algorithmic entropy form log2N/2 to N/2+ log2N/2 is theincrease in the uncertainty (i.e. the Shannon entropy) that arises by allowingvariants of the original structure.

The general case

Consider a set of M variations of a replica given by r0, r1, r2, . . . ri′M− 1. Ifsay M = 5, a string of length L representing the Ith particular arrangementsof replicas would be something like

SL(I) = r4, r4, r2, r4, r1, . . . r0

The entropy increase due to variation can be established by recognising thatthere are ML members of the set of strings of the same form as SL. In order tospecify a particular string, in the set of replicas, a code of length |ML| is needed

9.7. THE ENTROPY COST OF REPLICATION WITH VARIATIONS 149

. For self-delimiting coding, and ignoring higher terms the particular I ithstring requires L(log2M)+ |L|+ |log2M| bits. The length of the algorithm thatspecifies the characteristics of the set of replica requires |criteriaof replicaset|bits I.e. One can step through all the strings SL(I) that satisfy the replicarequirements until the particular code is reached to define a particular string.The length of the total code is;

H(SL) ≃ L(log2M) + |L|+ |log2M|+ |criteria of replica set|. (9.14)

For the particular case where all replicas r are identical, further patternexists, and in this case as M = 1 the L(log2M) and the log2 M| terms areeliminated. Equation 9.14 then becomes, as one would expect;

Hprov(identical replicate string) ≃ |L|+ |r|. (9.15)

This is equivalent to the period-2 string in equation 4.10 with |L| = log2L =log2N/2, while |r| = |10|, specifies the characteristic of the particular structureto be replicated. If N is large relative to the pattern description, equation 9.14,it is seen that the variation, leads to an increase in entropy of L(log2M) +|log2M| over the situation where all replicas are identical. However this is anupper limit as it is based on the assumption that the arrangement of variationsis random. This is seldom so in living species where, for example, all thevariations corresponding to an eye cell, occur in clusters. For this reason, theactual increase in provisional entropy where variations occur with pattern orstructure is likely to be much smaller. I.e. when internal triggers determinewhich genes are to be expressed, the information capturing this information isalready embedded in the growth algorithm. In essence, the provisional entropyof living beings built from variations of a core replicated structure may not besignificantly different from a structure with identical replicated units.

Natural ordering through replication processes

In contrast to an inert system that cannot maintain itself, equation 9.8 showsthat replication processes can naturally generate and re-generate ordered struc-tures with low algorithmic entropy, by ejecting disorder and, if necessary byaccessing high quality concentrated energy, such as photons etc.. The highquality energy repackages the system, separating the heat and degraded speciesfrom the ordered regions allowing natural processes to eject the waste. A sim-ple illustration is the growth of bacteria in a flow of nutrients which eventuallysettles in a stable situation, provided that waste products can be eliminated.Equation 9.8 is a schematic representation of the process which drives the sys-tem towards a stable set of configurations as the bacteria will replicate untiltheir numbers reach the carrying capacity of the system. The carrying capac-ity depends on the rate the nutrients are replenished and the rate at whichbacteria die producing the waste products that need to be ejected. In order tounderstand the computational processes that reflect this replication process,



equation 6.2 can be modified to allow a replication algorithm to embed manymicro steps into a larger replicating subroutine.

Here the parameter N will define the number of times the replication rou-tine is called to append a further replica to the string “STATE” before fi-nally halting. While the replication process is a parallel computing process,it can be written in a schematic linear form. At each cycle of the FORloop, the replication routine is called to append a further replica to the string“STATE” and at the same time, resources are depleted. The number of repli-cated structures, N , doubles and, for each doubling of N , grows until theresources fall below the minimum value to continue replication. I.e. when“RESOURCES < MINIMUM” the algorithm halts.

STATE = r

N = 1

FOR I = 1 to N

ACCESS NEW RESOURCES

IF RESOURCES

< MINIMUM GO TO HALT

REPLICATE r

DEPLETE RESOURCES

STATE = STATE, r

EJECT WASTE

NEXT I.

N = N +N

GO TO FOR

HALT (9.16)

If Nmax is the final values of N , the final output is STATE = r, r, r, r . . . rwith Nmax repeats of the basic unit r. The replication process is reversible ifnothing escapes the system.

The following examples are simple models to help envisage computationalprocesses involved in replication.

• The alignment of magnetic spins or the formation of a block of ice fromwater are simple example of a replication process. In the magnetic case,above the Curie temperature (see section 9.6), the spins in a magneticmaterial are disordered, as the thermal energy overrides the magneticcoupling between spins. The crystallisation of a block of ice is similar.In the magnetic case, if the temperature of the system lowers, but is stillabove the Curie transition temperature, it will settle in an equilibriumconfiguration where some, but not all, spins are aligned. In the case of ice,full ordering only occurs below the freezing temperature. It is only when


the ejection of heat drives the temperature of the system below the Curietemperature, or in the case of ice below the freezing temperature, will acomplete phase transition occur. In the low temperature magnetic case,once a few spins align, others will do the same as coupling between spinsdrives the replication process. The alignment of spins is a replicationprocess that creates order, once disorder as latent heat is passed to anexternal cooler environment. The particular configuration of the orderedmagnetised spin system (or a block of ice) could be defined by specifyinga unit cell of the structure and use a copy algorithm to translate a copy ofthe unit cell to each position in the lattice to specify the whole. However,because the vibrational states are the means by which heat is transferredto the environment, the specification of the instantaneous configuration ofthe vibrational and other states needs to be included in the descriptionto properly track entropy flows. This can be seen with the particularexample a crystal of ice. If each water molecule in the ice is denoted byr, the ice structure of N molecules is represented by a string r1, r2 . . . rNwhere the label denotes the position of each molecule in the structure.However the vibrational states that are coupled with these also need to bespecified. As a result, the full specification becomes r1, r2 . . . rN , φi(T ).Here φi(T ) represents an instantaneous specification of the vibrational,rotational, electronic states and other degrees of freedom for the of wholesystem at temperature T .

In contrast to using a copying algorithm, one can specify the system by areplicating algorithm that implements physical laws to grow the orderedsystem step by step. As these states emerge through the replicatingprocesses, they do not need to be specified separately. At each step,energy is passed from the structure to the vibrational states and then tothe external environment. In this case, given the initial vibrational states,the final momentum states of the system require heat to be passed to theenvironment for the replication process to continue. If one is interestedin comparing the algorithmic entropies of two different physical states,e.g. water and ice, where the laws of nature can be taken as given, asoutlined in section 6.3, the copying algorithm is only likely to be shorterthan the replicating algorithm if one chooses to ignore the vibrationaland other relevant states that are strictly part of the system.

• Another example that models typical behaviour is a system of identicaltwo-state atoms in a gas where stimulated emission in effect orders thephotons emitted from the excited atoms (see section 9.6) to form co-herent photons. In contrast to the crystallisation case, stimulated emis-sion requires the input of high quality energy embodied in the excitedatomic states to feed the replication process as well as the elimination ofheat. A symmetry breaking process takes place as the stimulated photonshave aligned momentum vectors, rather than being randomly oriented.Stimulated emission can be seen as a photon replication process. Themacrostate, consisting of the coherent photon state, a very small num-



ber of incoherent photons and most atoms in their ground state can besustained by ejecting heat and accessing energy to populate the atomicstates. In this situation the photon replicas die and are born as the sys-tem settles in an attractor. The string that represents the microstateof the system of coherent photons, needs to specify the coherent photonstate, the incoherent photons states and the atomic energy levels whichinterchange energy with the photons as the system moves from one mi-crostate to another all within the same macrostate. Other states of thesystem such as the momentum states, can be represented by φi(T ), not-ing that, as coherent photons are replenished, the energy that enters thesystem is passed to the momentum states and then to the environmentmaintaining a constant temperature.

Section 6.3 argues that a replication algorithm is the most efficient algorithmto define a structure such as a tree if all degrees of freedom are accountedfor. The replicating algorithm that generates the tree from the genetic codein the seed does not require the particular variation to be determined for eachcell position in the tree. The algorithm is efficient as it uses the same geneticcode in the DNA, but calls different routines depending on the environmentof neighbouring cells, and external inputs such as water, light and gravityto determine which actual variation is expressed as part of the developmentprocess. The environment of neighbouring cells in effect shows how the historyof the process determines the next state. Consequently a cell in the leaf isexpressed differently from a cell in say the root system. Nevertheless, if thefull number of replicated structures is to emerge, entropy as heat or wastemust ultimately be passed to the environment. In principle a growing tree is areversible system but, throughout the growth process, waste exits the system.As the waste drifts away reversibility becomes highly unlikely in the life of theuniverse. With these understandings it becomes possible to track algorithmicentropy flows and therefore thermodynamic entropy flows in real systems.

Replicating algorithms

Simple replication processes, such as the crystallisation of ice, or the alignmentof spins, are formed just by ejecting excess entropy, whereas most replicatingand ordering processes also require the input of high quality or concentrated en-ergy in addition to ejecting waste. Equation 9.7 shows that replicating systemshave a lower algorithmic entropy than a system of non-identical structures. Analgorithm that generates the repeated structure is similar to the discrete formof the logistics growth equation 9.8. I.e. the discrete equation below where jrepresents the discrete jth computational cycle.

Nj+1 = Nj + rNj(1−Nj/K), (9.17)

This equation maps the process of generating replicated structures until Nreaches the carrying capacity K. In which case, the algorithmic entropy might


be expected to be

Halgo(system) = |N |+ |generating algorithm|.

While this is only a simple example of a replicating system, any replicatingalgorithm operating under physical laws will have the same basic form.

In the general case, the replication process imitating the discrete form ofthe growth equation 9.17 can be seen as a computation that acts on an inputstring σ (= σ1, σ2 . . . σN ) representing the resource string embodying concen-trated energy and also the structure r1 which is to be replicated. The inputcan be represented by i = r1σ, while the output is the system of replicatedstructures together with a waste string. If the replication cycle loops N times,N replicated structures will emerge.

U(pN , r1, σ) = r1r2r3 . . . rN , w(T ′). (9.18)

In this process the number of replicated structures r1, r2, etc increases wherethe subscripts identify successive structures. The string w(T ′) is the historyor waste string that ultimately must be ejected from the system to inhibit re-versibility, and to allow the ordered structure to emerge as the system increasesits temperature form T to T ′. Other states that contribute to the algorithmicentropy of the replicated system, originally at temperature T , are essentiallyunchanged at the end of the process of replication. Originally, these states,which can be represented by φi(T ), become φj(T

′) as when the temperaturerises changing the vibrational states. As the waste string w(T ′) is ejected,φj(T

′) separates into φk(T ) + w(T ′) leaving φk(T ) behind. Because the be-fore and after contribution to the provisional entropy of the states representedby φi(T ) is unchanged, i.e. |φk(T )| = |φj(T )|, these states can be ignored.Overall, with the ejection of waste, the provisional entropy is lower and thestructure is more ordered.

The following argument tracks the degradation process and the restorationprocess from both a computational and a phenomenological point of view. Weassume the system is initially in a stable viable configuration denoted by ǫithat specifies the N repeated structures and also the other degrees of free-dom associated with the system. The set of stable configurations all have thesame provisional entropy. This implies the number of replicated structures isfixed, nevertheless, there are many different ways the thermal energy can bedistributed amongst the other degrees of freedom such as the vibrational statesof the system. The overwhelming majority of possible configurations will havethe same number of replicated structures.

The second law degradation process is given by equation 9.2 i.e.

U(p∗2ndlaw, ǫi) = ηj .

This shifts the system to the non viable configuration ηj . From the phenomeno-logical perspective, the replicated structures are destroyed at a rate given byαN that depends on the number present. The regulation computation given



by equation 9.4 accesses the resource string σ to regenerate the structure.The rate at which this can happen is usually limited by the rate at whichthe resource structures enter the system. This is taken to be β structuresper unit time. From the computational point of view the non-viable config-uration ηj = r1, r2, . . . rN−M , w(T ′) where M replicated structures have beendestroyed, or decay, creating the waste string w(T ′) that captures the historyof the process. The corresponding restoration programme preg,M that loops Mtimes to regenerate the M structures and ejects the waste w(T ′) is

U(preg,M , ηj , σ1, σ2 . . . σM ) = r1, r2, . . . rN , w(T ′),

The overall waste ejected includes the bits causing the replicas to decay; thebits in the regulation algorithm preg,M that restores the number of replicatedstructures from N − M to N ; together with the bits specifying the extra re-sources that have been added. I.e. all extra inputs need to be ejected to stopthe build up of the waste in the system.

While the growth characteristics align with Equation 9.17 the logisticsgrowth equations does not adequately track the resource flows for the repli-cation process. A more realistic approach that identifies the replication de-pendence on the resource availability is the Holling type-II functional responsewhich is similar to the Monod function or the Michaelis-Menten equation. Thisequation form can be used to describe the interaction of a predator and its prey(which in this case is replicator-resource interaction). As mentioned above, theequations below include a term αN to capture the decay or destruction ofreplicated structures due to second law effects in the growth equation. Theassumption is that the removal of structures (i.e. their death) at any time isproportional to their number with the proportionality constant is α. Also theconstant flow through of the resources β needed to drive the replication processis included. While a further assumption is that all the computational wastematerial can be ejected, as is mentioned later, the inability to eject wasteefficiently can limit the replication process. With this understanding, let Nrepresent the number of replicated structures and let σ represent the numberof resource units that are required to feed the replication process. In whichcase,

dN/dt =rσN

(b+ σ)− αN

dσ/dt = − rσN

γ(b+ σ)+ β. (9.19)

The parameter γ represents the number of replicated structures producedfrom the consumption of one resource unit, i.e. −γ∆σ ≈ ∆N . The maximumgrowth rate is the initial growth rate r when γσ >> N whereas b is the resourcelevel at which the growth rate is halved, i.e. when σ = b. The conditions fora stable long term homeostatic state occur when the two derivatives approachzero. In which case the number of replicated structures increases, tending to


a long term value N∞ = γβ/α = K where K can be identified as the carryingcapacity of the system.

Equations 9.2 and 9.18 are the computational equivalent of the replicationprocess captured by Equation 9.19 . In each unit of time β resources flow intothe system, maintaining the length of the resource string to σ. In the stableconfiguration, the resources are just sufficient to replace the αN replicatedstructures that decay over that period. From a computational point of view, thelong term description of the replicated structures becomes r1r2r3 . . . rK withK = N . The length of the algorithmic description of the string of replicatedstructures produced by a copying routine would be close to |r1r2r3 . . . rK , | =|r∗1 | + log2K. I.e. one replicated structure needs to be specified by r∗1 andthe specification repeated K times. Here the asterisk indicates this is theshortest description of the replica. On the other hand, if the length is derivedfrom the replicating algorithm rather than the copying algorithm, the dominantcontribution would come from |r∗1 |+log2N . The optimum number of replicatedstructures will emerge when N = K and log2N = log2K.

Replicated structures can be structurally slightly different if the inputs tothe replicating algorithm vary. For example an eye cell is generated by thesame algorithm as a heart cell but different genes are expressed because of thespecific environment surrounding the eye cell provides different inputs to thegenerating routine. Flexibility in function is achieved with little increase inalgorithmic entropy as the variation occurs when external triggers, or triggersfrom neighbouring replicas, call different routines.

Replicating systems act as pumps to eject entropy to form the ordered state.Entropy ejection is efficient in reducing entropy gradients and degrading highquality energy. When, for example, water crystallises through a replicationprocess, rather than cooling slowly, a spike of latent heat is generated whichpasses to the environment more rapidly because of the increased temperaturedifferential between the system and the environment. Similarly when highquality energy is converted to waste species in the replication process, theejected waste is more efficiently transferred to the environment compared withother processes. This is consistent with Schneider and Kay [92] who argue thatthe emergence of biologically complex structures is a move from equilibrium.As natural systems resist any such movement, living systems are more effectivethan inert structures in dissipating energy, by degrading high quality energyand in countering entropy gradients. Schneider and Kay show that evaporationand transpiration of the constituents of an ecology are the major forms of energyand entropy dissipation. From the perspective here, when species die and thestored energy is reprocessed and further degraded by species lower in the foodchain, further evaporation and transpiration takes place. Here, one would seethat replication is the driver of more efficient dissipation at the system level.

While the real world replication is a complex parallel computing process,the approach described above captures its essentials. Provided the creation rateof replicas exceeds their decay rate, the number of replicas will grow until thesystem settles in a stable set of configurations. In this homeostatic situation,the carrying capacity of the replicated structures is constrained either by the



rate at which the waste can exit the system, or by the rate the resource inputsare available to replace decayed replicas. Many living systems also need todo work, increasing the thermodynamic cost of maintaining the system awayfrom the global equilibrium. While systems further from equilibrium have agreater capacity to do work, they need to be more efficient in their use of theresources that are inputs to the computational process. They also need to bemore efficient in ejecting disorder that, from a computational point of view, isthe output string that exits the system. As is discussed below, evolutionarytype processes acting on a system of replicated structures select variations thatare more efficient in this sense.

9.8 Selection processes to maintain a system in a viableconfiguration

Genetic diversity, arising from random mutations, copying errors and crossoverscoupled with biological selection processes, are key to the emergence of struc-tures better adapted to the environment. This section looks at selection pro-cesses from a computational point of view, showing that diversity in the repli-cated structures, while better maintaining the system distant from equilibriumin an adverse environment, also drive the emergence of different structuraltypes. It is argued that, not only can a system of replicated structures countersecond law degradation processes by replacing lost species, if there is sufficientvariety among near identical replicas, adaptation processes can counter morevirulent disturbances impacting on the system. For this to happen, variationsin the replicated structure must emerge at a time that is fast relative to theexternal environmental changes. This adaptation can be viewed as a form ofregulation as, in general, organised systems consisting of variations of a sim-ple replica are more stable against change. They have an increased abilityto maintain themselves in a far-from-equilibrium situation as the set of viableconfigurations is larger. However it is a viable set that is expanding its hori-zons as diversity increases over time, primarily through changes to the geneticcode by copying errors, gene transfer, and more general mutations. Changesin the genetic information are equivalent to small modifications of the algo-rithm that generates the replicated structure, thus increasing diversity. Newresources flowing into an open system of replicas are in effect additions to theinput string expanding its state space. Similarly, resources flowing out of thesystem lead to a loss of information and a contraction of the state space.

Consider the replicator-resource dynamics where a disturbance threatensthe viability of the system of replicated structures. The disturbance, whichcould involve an inflow or outflow of species or energy, will vary the compu-tational paths undertaken by the system. Some variations of the replica maybecome less likely, while others become more likely in a changing environment.Those best adapted will dominate the replica mix. Examples might be wherethe variant of one tree species is better able to cope with drought or the emer-gence of drug resistant bacteria. The phrase “best adapted” in these cases

9.8. SELECTION PROCESSES TO MAINTAIN A SYSTEM IN A VIABLECONFIGURATION 157

generally means that, from an entropy flow point of view, the system is moreefficient in using resources in the sense the system is easier to maintain far-from-equilibrium. This resource efficiency will be termed “entropic efficiency”. Inessence, a system of replicated structures that has sufficient diversity increasesthe ability of the system to survive in an adverse environment. Nevertheless,it needs to be recognised that selection processes reduce diversity in the shortterm as a system with fewer variations has a lower algorithmic entropy. Butas in a living system, diversity inevitably reappears over time due to geneticmutations and genetic recombination, one can argue that these adaptive pro-cesses determine a long term trajectory leading to the biological evolution ofthe system. One can consider that these adaptive processes are a form ofquasi-regulation as they tend to maintain the system within the viable set ofconfigurations.

The insights of Eigen and his co-workers [50, 51] on the emergence andgrowth of self-replicating polynucleotides can be applied to more general repli-cating systems. In the polynucleotide case, copying errors provide variationsin the genetic code, some of which through selection processes, dominate themolecular soup. Interestingly, Eigen shows that a distribution of closely in-terrelated polynucleotides, which he terms a “quasi-species”, dominate themix rather than one particular variant. New structures can emerge if thereis sufficient diversity among near identical replicas, as adaptation processescan counter more virulent disturbances impacting on the system.

A simple model, based on predator-prey dynamics (or in this case the equiv-alent replicator-resource dynamics), provides useful insights on how variationand selection processes enable a system to adapt when two variations of areplicating structure compete for limited resources. As perhaps is no surprise,just as for the single replica case, a simple model where two variants competefor limited resources shows that the variant with the highest carrying capac-ity will dominate the mix. As is shown below, the carrying capacity dependson the efficiency of the replication process in the sense of efficient use of re-sources, and also the variant’s capability to withstand the onslaught of secondlaw processes. The argument could be articulated in terms of the interactionsbetween two subspecies in a competitive environment using coupled versions ofequation 9.8. This approach, which gives rise to the Lotka-Volterra equations,shows that Gause’s law of competitive exclusion applies and the variation withthe greatest carrying capacity is the one that survives [61]. However in thiscase, the carrying capacity is not an external parameter but depends on howefficiently replication processes use resources and also the variant’s capabilityto withstand the onslaught of second law processes. Consequently, coupledversions of the Lotka-Volterra equations are inappropriate. Eigen’s approach[50] is more realistic. From a computational point of view, where living systemscompete for resources, the biological selection processes that optimise the sys-tem’s use of resources operate at the genetic level. This reduces the informationinput needed to define the structure. This establishes the requirements for astable equilibrium, where there is a constant flux of monomers correspondingto the resources that feed the process. Eigen’s results are essentially the same



as the following simple discussion, which is based on the simple predator-preymodel addressed in section 9.7. In the approach here there are two variants ofthe basic replica with numbers N1 and N2 competing for the resources σ thatfeed the replication process. The first two of the following equations capturethe growth rate of the two variants, given their respective decay rates α andthe additional parameters c1 and c2 that represent the impact of one varianton the other. The last equation captures the rate the resources change, giventhe resource consumption by the variants and the constant inflow β of newresources.

dN1/dt =r1σN1

(b1 + σ)− α1N1 − c1N1N2

dN2/dt =r2σN2

(b2 + σ)− α2N2 − c2N1N2

dσ/dt = − r1σN1

γ1(b1 + σ)− r2σN2

γ2(b2 + σ)+ β. (9.20)

While these equations are similar to the equations for predator-prey re-lationships (see [2]), the last equation has a term β to indicate resources aresupplied to the system at a constant rate (see [50]). Another assumption is thatthe waste generated by this process is able to be ejected by the system. Theequations show that the system settles in a region determined by the rate theresources are consumed by the replication processes, rather than an externallygiven carrying capacity. The requirement for a non-periodic stable solutionis that all the derivatives become zero. For nearly identical replica variants,dN1/dt and dN2/dt are positive and approach zero together. In this case, therequirements are similar to the single replica case outlined in 9.7 and againthe system settles in a region determined by the rate the resources are con-sumed by the replication processes. However as outlined in more detail in theon-line supplementary material in reference [48] where say dN1/dt > dN2/dt,for realistic starting values of N1 and N2, because γ1 > γ2 and/or α2 > α1,both subspecies grow initially but eventually N2 ceases to grow and ultimatelydecreases as the term c2N1 drives dN2/dt through zero. Once this happens,the population of subsystem N2 goes to zero while N1 trends towards its max-imum value. Assuming variant 2 becomes zero, The limit of N1, i.e. N1∞ isN1∞ = γ1β/α1 indicating that the carrying capacity of the variant dominat-ing the system is K = N1∞. This is the variant with the highest replicationefficiency and the lowest decay rate due to second law effects (i.e. the highestγ/α). As this is the variant needing the lowest entropy throughput, resourcefitness corresponds to what will be termed “entropic efficiency”. The result canbe generalised to more than two variants as shown by equation II-47 of Eigen[50]. The above equation gives essentially the same result as Eigen for thestable long term configuration. However Eigen’s approach shows that a closelyinterrelated quasi-species dominates the mix rather than just one variant. Inthe Eigen approach φM is set to equal the average growth rate and death rateover all the variants. Here φM corresponds to γβ.


One can argue that this is a like a regulation process which selects themost efficient replicating variant to return the system to a state marginallydifferent from the original. This allows the system to survive the threateningdisturbance. However, there is a trade-off. While increasing the variety in thereplicated structures increases the entropy of the system, making it closer toequilibrium, variety stabilises the system against environmental change. For asystem of N identical structures, such as a forest of identical tree grown fromseed, the algorithmic representation of N in the generating algorithm will be≈ log2N and the provisional entropy would be something like H(tree)+ log2N(see section 9.7). However if there are M variants of the seed representingthe variants in the genetic code, a system of N structures now requires analgorithm that identifies the particular variant for each structure. As thereMN possible distributions of variants, Nlog2M replaces log2N in the provi-sional algorithmic entropy measure (see section 9.7). The provisional entropythen becomes something like H(tree) + Nlog2M. While diversity increasesthe provisional algorithmic entropy, if there are few variants, this increase willbe small relative to the overall generating algorithm. If the variation is codedat the genetic level rather than at the structural level, as has been argued, inthe competitive situation for most living systems, the selection processes willoperate at the genetic level. Just as genetic algorithms mimic real world selec-tion processes, the algorithmic approach sees these real world computations asgenetic algorithms.

While chance rather than physical laws, may be the determining factorin many evolutionary selection processes, entropic efficiency may still be im-portant in semi- stable environments. The question is whether such entropicefficient selection processes, considered here, are widespread. Nevertheless, onecan envisage the situation where a process that selects for entropic efficiencymight be perceived as a process that selects for behaviour. For example, sepa-rated structures that have some movement might be selected to increase move-ment enabling the structures to move together or apart to control the overalltemperature. From the computational perspective, variants having this move-ment would be more entropically efficient in a changing environment. Hencewhat might be seen as adaptation at the biological level can be interpreted asa drive for entropic efficiency at the algorithmic level. Another example mightbe those bees that have developed the ability to warm their hive by beatingtheir wings when the temperature drops. Presumably, this behaviour driverarises as regulation has been built into the genetic code’s software through se-lection processes that operate on individual bees. The effect is that entropicefficiency of the whole system of bees has increased. What might be seen asadaptation, or “survival of the fittest” at the biological level can be interpretedas a drive for entropic efficiency at the algorithmic level. This is a drive thatuses replication processes, based on natural laws, to regulate the system. Ingeneral terms, variation in the core replicated structure can be selected by theenvironment to provide an autonomous response to external factors and so canbe seen as quasi-regulation process.

However, while the system selects the variant that optimises the entropic



efficiency of each replica, the overall entropy exchange with the environmentdoes not decrease as the number of structures in the system increases to counterany gain at the level of an individual structure. As is discussed later, this isconsistent with Schneider and Kay [92] who show that as biological complexityincreases, the overall system becomes more efficient in degrading high qualityenergy.

While variation increases the provisional algorithmic entropy, if there arefew variants, this increase will be small relative to the overall generating al-gorithm. Selection processes that favour one particular variant of the replicareduce the provisional entropy of the system over the short term. Over thelong term further variations may well increase the provisional entropy for fu-ture selection processes to act on the system.

What does this mean in terms of the regulation equation 9.6? The inputstring σ includes the nutrients to maintain the replicas and the waste needsto be ejected perhaps to the lower temperature sink in the environment. Theprinciples are captured in a set of bacteria in a nutrient mix that exists in a setof viable configurations by accessing the nutrient and ejecting waste heat andmaterials to the external, cooler environment. Clearly if the nutrients becomeunavailable, or the temperature of the environment rises or an antibiotic entersthe bacteria food chain, most bacteria will die and those better adapted willsurvive and dominate the population. From a computational point of viewcertain computations terminate whereas others do not.

Adaptation and interdependence of replicas in an open

system

Open systems of replicated structures can become interconnected and inter-dependent leading to more efficient resource use. Such interconnections canemerge when the waste string ζ of one system becomes the resource input σ inanother. Fewer resources are needed to maintain the combined systems. Thepredator prey is one such relationship as the stored energy in one system (theprey) is used as the resource input to the other. The predator is just a highlyefficient method of degrading the high quality energy embodied in the prey.Alternatively, mutualism can occur when two different species enhance eachother’s survival.

In general, a structural map of such an interconnected and nested structuresidentifies the information flow diagram and, where repeats of algorithms areidentified, these can be mapped as subroutines to show how the overall algo-rithm can be efficiently organised. Thus when an interconnected structure canbe simply mapped, the overall algorithm implementing the flow diagram willbe shorter indicating a more ordered system. As real world computations areparallel processes, the diagram will initially identify these parallel processesbut can be reconfigured to produce a serial computation process where rou-tines pass information to each other. That such interconnections reduce thealgorithmic entropy can be seen by considering two systems A and B. If thecomputation of B is shorter when it can use information obtained from the


computation of A as an input, then;

H(A,B) = H(A) +H(B|A∗) < H(A) +H(B). (9.21)

Because H(A,B) < H(A) + H(B), the order is greater, as for a given set ofinputs the discarded entropy in the waste string is greater. This occurs asthe process extracts more high grade energy from what might have been thewaste, and in so doing increases the entropy of the waste. Fewer resources areneeded to maintain the combined systems. Where interdependence occurs, thefar-from-equilibrium coupled systems become more viable and further fromequilibrium as the overall entropy of the resources needed to maintain theamalgamated system is less. As a consequence, in a resource constrained envi-ronment, interdependence will emerge in preference to alternatives, as coupledsystems are more stable against input variations. In contrast to two inde-pendent systems, as the algorithmic specification of the system involves thesharing of information between routines, which are all part of a nested hierar-chy or lined horizontally, fewer bits need to enter and therefore to be ejectedfrom the coupled structures. As a consequence, the far-from-equilibrium cou-pled systems become more viable as the overall entropy cost of maintainingthe system off equilibrium is less. Coupled systems may also share resources,when total heat loss is reduced as replicas cluster. With appropriate values ofthe parameters the coupled systems can settle in a stable population regime.Furthermore, where the overall temperature becomes higher, the system hasmore capability to do work.

A simple example of interdependence is where photons that exit one lasersystem are used to create a population inversion in another laser system. Inwhich case, less information is lost to the environment with the combinedsystem than when the two lasers operate independently. The sharing of in-formation between routines requires fewer bits to enter and therefore to beejected from the coupled structures. Where overall resources are constrained,and there is sufficient variety in the second system (e.g. sufficient line widthoverlap between the lasers) this system can adapt by settling in more restrictedareas of its state space. From an algorithmic point of view this creates mu-tual dependence as information is shared horizontally between structures at thesame level in a hierarchy and vertically between nested structures. A collectionof mutually dependent structures amalgamate to form a larger system that isfurther from equilibrium.

Adami’s concept of physical complexity as a measure of viability of lowentropy structures [1] supports this observation. He states “In this case, how-ever, there are good reasons to assume that, for the most part, co-evolutionwill aid, rather than hinder, the evolution of (physical) complexity, becauseco-evolution is a slow rather than drastic environmental change, creating newniches that provide new opportunities for adaptation.” Where such structuresare possible, they would seem to be likely. The situation where one set of con-stituents enhances the survival of another can be captured by Equation 9.20by making c1 negative and c2 zero. With appropriate values of the parametersthe coupled systems can settle in a stable population regime.



Observationally it appears that, in comparison to independent structures, asystem of nested subsystems (such as a forest ecology nesting species, with anindividual of a species nesting organs, and organs nesting cells) is more viablefar-from-equilibrium, as the waste of one part of the system is the resource inputfor another. At the very smallest scale of an ecology system, the substructuresconsist of variations of a basic unit such as cells, where variations occur whengenetic switches express certain genes in preference to others. The algorithmicequivalent of these switches corresponds to information passed from a higherroutine to the generating subroutine. This may come from a neighbouring cell,as is the case when a particular set of cells produce an organ such as an eye.Alternatively the information may be inputs from the external environment.Nesting also allows specialisation of function, with little increase in algorithmicentropy, particularly if the output of a nested algorithm depends only on theinformation inputs. This allows the pattern of variations to emerge as thestructure evolves by a generating algorithm similar to Equation 6.2. As hasbeen mentioned in section 6.3, the algorithm that describes the trajectory thatgenerates the structure, such as the algorithm embodied in the seed of a tree,is likely to be shorter than one that describes the structure in its final formunless significant information can be ignored. As the above argues, overallthere is less need for waste to be ejected per nested unit because of the mutualdependencies and the efficiency in ejecting disorder (such as heat) is higher.

Variety in nested systems and degree of organisation

The nesting of structures is a particular form of interdependence that increasesthe overall order. The nested structures can be mapped algorithmically toa nested set of subroutines that pass information to each other as indicatedby Equation 9.21. The nested algorithms not only form a short descriptionbut, because they pass information to each, eject less waste information perstructural unit. The idea can be illustrated by imaging a tree that we couldonly observe at the smallest level of scale not recognising any order at a largerscale. As at the smallest scal,e order is not perceived, the tree can only bedescribed by a lengthy algorithm that specifies the structure at the molecularlevel and how the molecular structures combine to form the tree. However ifwe perceive the tree at a larger scale we might see it as a collection of cells.That would allow a shorter description as the cell algorithm only needs to bespecified once. At the largest scale, denoted by d0, the overall structure andinterconnections would be recognised allowing a much shorter description ofthe tree to be found. Chaitin’s [26] concept of ‘d-diameter complexity’, whichquantifies order at different levels of scale, is helpful in understanding nestedstructure and the entropy cost of introducing variety. Figure 9.1 illustrates anested system of replicated structures (see Devine [47]). Here the smallest scaleis denoted by denoted by dn where n is large and dn is small. At this scale, asno pattern is discernible, the system can only described by an algorithm thatspecifies what appears to be a random arrangement of fundamental structuresindistinguishable from a local equilibrium macrostate.


d2

d3

d0

d1

Figure 9.1: Nested structures at scales d0, d1, d2 and d3.

The provisional algorithm specifying the arrangement would be extremelylong and the algorithmic entropy would be indistinguishable from that of alocal equilibrium macrostate. However, observing the system at a higher levelof scale allows some structures, such as molecules, to be be discerned and thealgorithm specifying the system becomes shorter. At an even higher level ofscale, cells might be recognised. In which case, once the cell is defined, theoverall algorithm becomes much shorter, decreasing the algorithmic entropy.This is because the structure can be more simply built by repeating the cellularbuilding blocks.

Using this approach, Chaitin [26] quantifies the degree of organization(Dorg) of structure X at the scale n as the distance the system is from alocal equilibrium where there is no recognisable pattern. The degree of organi-zation, is a self-delimiting variant of Kolmogorov’s “deficiency in randomness’.The degree of organisation Dorg at scale n with self-delimiting coding is givenby:

Dorg(dn) = Hmax(X)−Hdn(X).

Hmax corresponds to the logarithm of the number of possible states where nostructure can be discerned. As with increasing scale, more structure is identifiedthe length of the generating algorithm shortens finally reaching a minimum atthe largest scale which identifies the maximum value of Dorg.

Figure 9.2 illustrates how the algorithmic entropy varies with scale param-eter d [47]. In the figure, d2, d1 and d0 denote the scale dimensions from figure9.1 with dn << d2 < d1 < d0.

Let Hd0 represent the minimum value of the algorithmic entropy of thesystem, based on its shortest description at the largest scale d0 [46]. At thislevel, Hd0 << Hdn and, as the algorithmic entropy reaches a minimum as thenested structures that emerge can be described by interconnected subroutines.



Figure 9.2: Variation of d-diameter complexity with scale; — nested repli-cators with no variation; – – – – – nested replicators with variation; - - - - Noorganization at any scale [47].

In the figure 9.2. The stepped bold line shows an ideal system where all thenested systems at each level of scale are identical replicated structures. Theentropy remains about the same until a lower level of scale suppresses structure.As the scale parameter d decreases from d0 to dn the entropy increases inwell-defined steps. Until at the smallest scale, when large scale structure issuppressed, the system is perceived as a collection of molecules. As all structureis gone, Hmax = H(sequil). In which caseDorg becomes a measure of the degreeof order. i.e.Dorg = Dord = Hequil −Hd0

However where variations in the replicated structures occur, as happenswith different cell types at the same scale, the steps are smoothed in the dashedline. Diversity increases the provisional entropy at each level of scale. Whenno organization at all exists, the scale is irrelevant, and the algorithmic entropyis the same at all levels of scale as shown by the dotted horizontal line, Hmax

in figure 9.1.

The implication is that where an algorithm, such as that embodied in DNAcontains subroutines that can be switched depending on a particular input,the diversity increases with only a small increase in the algorithmic entropy.As software variation occurs at lower levels of scale, it would appear to bealgorithmically more efficient to generate the variation through the software(e.g. variation in DNA) rather than independently specifying variants of similarstructures. It is plausible to conjecture that the nesting of structures, decreasesthe entropy in a way that more than compensates for the entropy cost ofbuilding in diversity in the replicated structures [47]. Indeed, this may be aninevitable consequence of selection processes acting on structures. Identicalreplicated structures with high Dorg, probably have less ability to adapt to


change than systems with variation or greater diversity. The cost of allowingvariation is a lower Dorg.

Mutualism and life

As selection processes favour entropically efficient replicating structures, the re-duced dependence on external resources means the carrying capacity of interde-pendent nested systems is higher than systems where interdependence does notexist. As the coupled systems co-evolve by using resources more efficiently, in aresource constrained environment the coupled systems are more stable againstinput variations. These factors suggest that such a system searches for en-tropically efficient configurations to maintain survival of replicated structures.Where this interdependence increases the entropic efficiency of the systems un-der resource constraints, interdependence is more likely to eventuate. Theirmutual attractor region will not drift through state space at the same rateas similar, but uncoupled, systems. Nevertheless, where such structures arepossible, they would seem to be likely. This tends to build interdependence asmutualism, or interdependence between systems of replicas increases the orderdriving the organised system further from equilibrium.

This argument has been applied to coupled biological systems ([48]) whichsuggests that life on earth is more efficient in maintaining a system further fromequilibrium than no life, as life, as life creates its own autonomous capabilityto survive and adapt. The caveat is that a concentrated source of energysuch as photons from the sun must be accessible, and waste must be able tobe ejected. However there is a cost as life drives the whole system furtherfrom equilibrium at the expense of being a more efficient dissipator of energy.Observationally it appears that, in comparison to simpler structures, a systemof interconnected and nested systems (such as a forest ecology nesting species,with species nesting organs, and organs nesting cells) is more viable far-from-equilibrium, as the waste of one part of the system is the resource input foranother.

While overall, less waste per unit structure is ejected, as Schneider and Kay[92], argue that a system of hierarchical structures, such as a complex ecology,is more effective in degrading high quality energy than simpler or inert struc-tures and pumps out more useless energy at a greater rate. Hierarchical andinterconnected systems make better use of the resources, but at the expenseof driving their universe more quickly to equilibrium as high quality energy isprocessed more efficiently. At each step in an interconnected hierarchy, wasteejected by one system can be degraded by a lower system in the hierarchy.This happens when insects, fungi or bacteria degrade plant material far morerapidly than would happen if driven by non-living processes. However, whileresource use is more efficient in nested systems, as the population of the cou-pled structures increases, the overall degradation of the input resources is bothmore rapid and more complete than would otherwise be the case. More re-sources are consumed as the population of different replicated structures growsto more efficiently process high quality energy. In a nutshell, individual repli-



cated structures seek entropic efficiency, but at the system level, the populationrises to increase overall dissipation. The more coupled the system, the greaterthe structure, but on the above arguments, the overall degradation of the in-put resources is both more rapid and more complete than would otherwisebe the case. One can also envisage a firm, a city or a national economy asa connection of interdependent sub structures analogous to Beers Viable Sys-tems Model ([7, 8]). In contrast to the neoclassical concept of equilibrium,such interconnected systems are far-from-thermodynamic-equilibrium. Froman algorithmic point of view, one can see the various structures and nestedsubstructures within this system as computational units, functioning under ahigh level computer language that captures their relevant behaviours. The highlevel language can in principle be expressed in binary terms if needed. Howeverfrom the previous argument in Sections 5.2 and 9.4, the algorithmic entropyof a complex nested set of subsystems maintained far from equilibrium by theflow through of information, can be determined from the entropy or informa-tion flows into or out of the system up till a given time. All that one needs toknow is that in principle the detailed algorithmic behaviour can be captured,and then the algorithmic entropy determined from the next entropy flow. Asinternal information flows cancel, the overall algorithmic entropy is determinedfrom the net information flow into the system. From Landauer’s principle, thisentropy flow completely aligns with traditional entropy flows but with the addi-tional argument that the net entropy remaining in the non-equilibrium systemhas real meaning as it is derived from the algorithmic entropy. One can thenargue that the sophistication of a far from equilibrium system such as an econ-omy or an ecology is its degree of order Dord as measured by its distance fromequilibrium as this is a measure of how effectively the system uses the entropyor information throughput 6.2.

9.9 Summary of system regulation and AIT

Algorithmic Information Theory identifies the algorithmic entropy or the in-formation content of a system by the minimum number of bits required todescribe the system on a UTM, while Landauer’s principle [74] holds that theerasure of one bit of information corresponds to an entropy transfer of kBln2to the environment. Together these understandings provide a consistent wayof looking at the thermodynamic cost of maintaining a system in stable set ofconfigurations distant from equilibrium [114]. As only entropy differences havephysical significance, the number of bits needed to shift a system from oneconfiguration to another, multiplied by kBln2, corresponds to the equivalentthermodynamic entropy change of the system. In thermodynamic terms, wherea disturbance or a second law degradation process, threatens the viability ofa system by increasing the algorithmic entropy by H bits or equivalently thethermodynamic entropy by kBHln2 Joules per degree Kelvin, the thermody-namic cost of restoring the system is to compensate for the change by removingkBHln2 Joules per degree Kelvin.

9.9. SUMMARY OF SYSTEM REGULATION AND AIT 167

Furthermore, it is argued that, as a system of replicated structures canbe described by few bits, natural replication processes use all the availableresources to generate simple ordered structures with low algorithmic entropy.This perspective allows one to recognise that a natural replication process canbe envisaged as computation on a real world UTM that is able to generate orderfar-from-equilibrium. Not only can replicating processes generate order, theseprocesses can self-regulate or self-organise as they can naturally compensatefor second law degradation by accessing external resources as inputs to re-generate degraded replicated units while ejecting entropy as waste, fulfillingthe requirements of Landauer’s principle.

When there is a changing environment, and there is diversity among repli-cating units, selection processes act as an autonomous quasi-regulation process.The replica variant that persists is likely to be the one that is more entropicallyefficient in terms of its use of resources and its capability to eject waste. Inthese cases, environmental fitness carries with it a requirement to optimise theentropic efficiency of the structures in the system. Entropic efficiency meansthe entropy cost of maintaining the system is minimised, and the system adaptsby settling in a far-from-equilibrium configuration that is as close to reversibleas possible. However while selection processes do allow a system to returnto something close to the original configuration, overtime the system drifts orevolves to new configurations as new variations appear.

Similarly, where interdependence between different replicated structures ispossible, simple calculations suggest that selection processes will favour thevariants that exhibit interdependence. As interdependence is more entropi-cally efficient at the structural level, the number of interdependent structuresincreases. Hierarchies of replicated structures would seem likely to emerge asselection processes favour those that waste fewer resources. But as the numbersof such structures increase the overall energy dissipation is higher and, whilethe hierarchical system persists, it hastens the degradation of the high qual-ity concentrated energy that feeds, the living process. As Schneider and Kayshow [92], an ecology consists of myriads of self-replicating efficient units. But,because species lower in the food chain consume the waste of higher species,the overall degradation of the system is greater than would be the case for anon-living system, or even a simpler living system. Living systems hasten thedeath of the universe.

Where a system needs to do work, it must be maintained further fromequilibrium and again replicating process can maintain the system in a suitablestable configuration.

The results are general and can in principle apply to any system. Howeverin many real world situations there will be insufficient understanding of thedetails of the system to quantify the cost of maintaining the system far-from-equilibrium. Nevertheless, the approach may well be useful in understandingincremental changes to real systems and provide broad descriptions of somesystem behaviour. While there is more to a living system than replicationor entropy flows, just as energy constraints determine system behaviour, soentropy requirements constrain structural possibilities.

Chapter 10

AIT and Philosophical issues

10.1 Algorithmic Descriptions, Learning and ArtificialIntelligence

Algorithmic Information Theory provides some practical insights into learningand artificial intelligence. One can take an intuitive view of learning as a processby which an agent, when given data associated with some phenomenon, canmake sufficient sense of the data to partially predict future outcomes. Learningis impossible where the observations show no pattern or structure. As theseobservations cannot be compressed, each future outcome is a surprise. Leibnitz([35] has argued that the best theory compresses data more effectively thanalternatives. As a consequence, the better the theory, the better the learningthat takes place and the better it can predict future outcomes. As the fieldsof learning and artificial intelligence are specialist disciplines, only the flavourof the algorithmic insights can be outlined here. Hopefully those who areinterested can delve deeper into the topic.

Two lectures by Solomonoff [99, 100] review the connection between com-pression, learning and intelligence. Learning can be interpreted within an AITframework. In particular, the outcome of a learning process is as effective asthe Minimum Description Length procedure. As has been outlined in section7, the Minimum Description Length (MDL) approach argues that the best fitto observed data is found by the physical model that provides the most com-pressed version of the data. It was pointed out the approach satisfies both thesimplicity embodied in Occam’s razor, and the breadth of Epicurus’ princi-ple, as the approach seeks the simplest explanation without rejecting possiblecauses. However there are some deeper questions surrounding MDL. Vitanyiand Li [109] have pointed out that data compression using MDL is almost al-ways the best strategy for both model selection and prediction, provided thatthe deviation from the model is random with respect to the envisaged prob-ability hypothesis. What this means is that to be effective, there can be noidentifiable order in the deviations. As was mentioned in section 7, the idealapproach to MDL can be derived from a Bayesian approach using the algo-

169

170 CHAPTER 10. AIT AND PHILOSOPHICAL ISSUES

rithmic universal distribution of section 3.8 as the prior distribution. Withthis universal probability, the MDL approach minimises the sum of two terms.The first term characterises the model by the universal distribution, whichis effectively the algorithmic description of the model, and the second codesthe deviations from the model by a code with length equal to the logarithmof probability of that deviation; i.e. a code that satisfies Shannon’s noiselesscoding theorem 3.3. Barron and Cover [4] apply the approach to selecting themost appropriate probability distribution to explain a set of random variablesdrawn from an unknown probability distribution. They show that the twopart MDL approach will eventually find an optimum data distribution fromthe class of candidate distributions. However the approach is one based onhaving an effective strategy to find potential models, rather than having anybelief about the simplicity of the true state of the world. In contrast to theoriginal Bayesian approach, MDL makes no assumptions about the underlyingdistribution, or any subjective assessments of the probabilities. As Grunwaldpoints out [62], MDL is asymptotically consistent without any metaphysicalassumptions about the true state of things. Furthermore, the practical MDLapproach is an effective method of inductive inference with few exceptions.

Vereshchagin and Vitanyi [108] show that the structure function determinesall stochastic properties of the data: for every constrained model class it deter-mines the individual best fitting model in the class, irrespective of whether the“true model is in the model class. In this setting, this happens with certainty,rather than with high probability as is in the classical case. It becomes possibleto precisely quantify the goodness-of-fit of an individual model with respect toindividual sets of data.

Hutter [65, 66, 68] combines the idea of universal sequence prediction withthe decision theoretic agent by a generalised universal semi-measure. Inductionis optimised when the agent does not influence the outcome, whereas combiningthis with decision theory allows the agent to interact with the world. The modelhe proposes is not computable, but computable modifications of it are moreintelligent than any time or space bound algorithm.

10.2 The mathematical implications of AlgorithmicInformation Theory

Chaitin [36] argues that, as there are mathematical statements that are truebut not provable, mathematics should be taken to be quasi empirical, ratherthan the usual a priori approach. This would allow new axioms to be addedto the formal mathematical structure where there is some evidence to supporteither the truth or falsifiability of the unknowable statements. Chaitin [34]argues that Einstein saw mathematics as empirical, even suggesting that theintegers are a product of the human mind. Chaitin also argues that, whileGodel, with his Platonic world view, saw mathematics as an a priori activity,Godel does state that: “If mathematics describes an objective world, there is noreason why inductive methods should not be applied the same as in physics”.

10.3. HOW CAN WE UNDERSTAND THE UNIVERSE? 171

This leads Godel to argue that a probable decision about the truth of a newaxiom is possible inductively. Where such a possible axiom has an abundanceof verifiable consequences and powerful impacts, it could be considered as anaxiom in the same sense that a hypothesis is accepted as a physical theory.The implications are that mathematics, by judiciously selecting axioms thatappear to be true for good empirical reasons, but which are not yet provablecan decide “truth” empirically.

10.3 How can we understand the universe?

Those researchers, who see the evolving path of the universe as a computationalprocess, raise the possibility that the universe is itself digital computer. Thefirst to do so was Leibniz according to Chaitin [35], as Leibnitz believed indigital physics. More recently such notables as John Wheeler (quoted in [27])coined the phrase ‘It from bit’, while Ed Fredkin [55, 54] sees the fundamentalparticles of the universe as processors of bits of information. At the Planckscale of 1.6×10−35 m, quantum effects dominate, and the universe would seemto behave like a quantum computer almost by definition.

Seth Lloyd [82] (see also [81]) explores the idea of the universe as a computeras it processes information in a systematic fashion. He suggests the universemay perform quantum computations where every degree of freedom registersinformation and the dynamical interactions between degrees of freedom processinformation. Lloyd puts bounds on the processing capability of the universe.If gravitation is taken into account, the total number of bits is ≈ 10120 andoperations ≈ 10120 are simple polynomial functions of the constants of nature,h, c,G and the age of the universe. The number of operations within timehorizon t is the Planck time, where gravitational effects are of the same orderas quantum effects. Lloyd looks at a matter dominated universe and a radi-ation dominated universe, and has found that the formulae for describing theoperations and the number of bits are about the same in both cases. It is notthe purpose of this work to delve deeply into the points Lloyd makes, but topoint readers to the possible ramifications of seeing the universe in such a way.

Even if the universe is a quantum computer, the strong Church Turingthesis implies that under the circumstances, where there are no infinities in theuniverse, every finitely realisable system can be simulated by a finite TM [44].However there are arguments against a digital physics based on continuoussymmetries and the implication of hidden variables [49].

Even so, it would seem reasonable to assume that above the Planck scaledigital computations may effectively describe physical processes. The internalstates of the computer can be considered discrete to within quantum limits.Similarly there is little difficulty in envisaging the universe as a computer thatsteps through a trajectory in its reasonably well defined state space. For ex-ample, the behaviour of say a gas in a container can be seen as a computingprocess when collisions move the system from one microstate to another.

Paul Davis in the 5th Miracle (page 95 [43]) states that the universe has the


same logical structure as a computer. He argues from Algorithmic InformationTheory that physical laws show pattern, and therefore are information poor inthe sense that the laws are algorithmically simple. The data embodied in lawsis assumed to be compressible. On the other hand, Davies suggests the genomeis information rich and is close to random. (It should be noted that algorithmsexist to compress DNA for information storage purposes. So the genome iscertainly rich not necessarily close to random). Davis then argues that thegenome is highly specific because it arose from evolutionary processes throughrandom mutations and natural selection. This leads to a compressed, butspecific, information structure. Personally I think that replication processes,as I argue in section 9, are a natural way to generate ordered structures.

Wolfram [112] puts forward a stronger view and argues that the order wesee in the universe arises through simple computational processes analogous tothose undertaken by cellular automata. If so, the universe is not random butpseudo-random, as the complex structures that emerge do so through relativelysimple computational processes. Wolfram then surveys simple computationalworlds hoping to find insights into the computational processes of the uni-verse itself and claims a Theory of Everything (TOE) is in principle possibleby trialling different simple computational machines. He believes that inter-preting the universe in digital form is likely to provide better insights thanthe more conventional mathematics based on continuum dynamics. Interest-ingly, in chapter 12 of his book, Wolfram [112] provides a detailed account ofAlgorithmic Information Theory.

Another approach is that of Chaitin [27] who discusses the intelligibility ofthe universe in terms of AIT. He quotes Weyl who drew attention to Leibniz’claim that physical laws must be simple. Chaitin supports Weyl’s belief thatsimplicity is central to the epistemology of science and gives examples from anumber of eminent scientists about simplicity and complexity. These scientistssupport the view that a belief that the universe is rational and lawful is of novalue if the laws are beyond our comprehension. Is then a TOE possible? Sucha TOE would maximally compress the observable universe. Chaitin quotesBarrow [27] in arguing that, even if a TOE was uncovered, there would be nocertainty that a deeper explanation might exist. This possibility appears tocontradict Wolfram’s views outlined above that a simple generator of physicalreality might be possible. However there is no contradiction as Chaitin pointsout. The TOE he and Barrow are describing is an algorithmic description,effectively limited by Godels’s improvability, whereas Wolfram suggests sys-tematically searching for simple generators of the universe starting with themost simple.

Calude and Meyerstein [19] raise the question as to whether the universe islawful or not. Their conclusion is that the universe is lawless in the sense thatthere is no overall structure implied by the word “law”. Their argument ineffect is saying that the human mind is not capable of grasping the behaviourof the universe. While these authors refer first to the historical perspectivesfrom Plato to Galileo, their own argument focuses on the strong articulationof Godel’s theorem that shows the set of unprovable but true statements is


large, improvability is everywhere. The arguments are interesting and cap-ture Chaitin’s statement that: “that god not only plays with dice in quantummechanics, but even with the whole numbers’. Calude and Meyerstein [19]suggest that the universe may best be described by what they call a lexiconproduced by say the toss of a coin. Within this infinite lexicon, every possiblesequence will occur. Sections of this string representing our universe may bepartially ordered, and we can hope that this ordered part of the lexicon maybe partially explained by science. But elsewhere the string may be random.However where we do make sense it will only be for that part of the lexicon thatis ordered. Calude and Meyerstein argue that as the human mind can neverfully understand mathematics, the human mind can never fully understandthe physical world. We are part of the system and we do not have the energyresources needed to probe the relativistic universe. On the other hand, Castiand Karlqvist [22] have the hope that, while the mathematical universe mightbe too algorithmically complex for the human mind, in principle the physicaluniverse might not be.

Why can we make any sense of the universe?

We do seem to be able to make sense of the universe, but our capability to doso would seem to be limited for three reasons.

• The storage capability of the human mind is limited. Even if this isextended through external storage and processing of information (suchas was done in the proof of the four colour theorem), there will stillbe finite limits to the amount of information humans can process. Ifcomputers do the processing for us, will we have the same confidence inthe result if our minds are unable to process the information directly?

• Even if human access adequate storage, they might have insufficient timeto undertake sufficient computing steps to come up with an adequateunderstanding of say a law of the universe. This is analogous to the factthat it may take longer to calculate the weather than the universe takesto generate the observed weather.

• Finally the Godel limit may well constrain human sensemaking. Even ifthe other problems did not limit human rationality, the above arguments,implicit in Godel’s theorems, indicate that most of the statements thatcharacterise the behaviour of the universe might well be unprovable to us,either as individuals or as a collection of humans. In effect, if scientificunderstandings are based on formal logical processes, most theorems willbe beyond us.

The question then arises as to why we are able to make any sense of any-thing. As Einstein, quoted by Chaitin [27] said: “The most incomprehensiblething about the universe is that it is comprehensible”. It could be that sensemaking through physical laws is not really rigorous in the formal system sense


and therefore is not limited by Godel’s theorem. We may have explanationsthat work, but which might at some level not require consistency, but rather alesser form of certainty, perhaps embodied in a probability. It could be that ourbrains are quantum computers and are not constrained by Godel’s theorem.However the following argument suggests that reductionism, which is core tomuch of our sensemaking, works because the observable universe is relativelysimple.

While the scientific reductionist approach seems to be extraordinarily suc-cessful in making sense of the universe, despite the attacks of some philosophersand postmodernists, this is a surprise.

Yet we do seem to be able to make sense of the world by reducing the whole,which appears impenetrable, to smaller manageable components. Surely thisis because the world is structured that way. In other words the surprise is notthat the universe is complex, but that it is algorithmically simple. This is asimplicity that allows the observed universe to a large extent to be reduced tosimple laws. These laws are so simple that they can be processed by a finitebrain that is insignificant relative to the rest of the universe. Such simplicityseems to arise because small parts of the universe can be described simply. Ican describe algorithmically the behaviour of a falling stone; a stone which ispart of a much greater whole. My algorithmic description of a falling stonecan be seen as a subroutine in a much more complex algorithm. But to alarge extent, the inputs of the rest of the universe make little difference to theresult. The rest of the universe impacts on the falling stone mostly as smallperturbations. It would seem that it is because the universe we observe consistsof structures nested within structures, or order nested within order, that wecan make sense of it. Each level of nesting can be mapped on to a subroutinethat describes the structure at that level. At our level of inquiry, many of thesubroutines describe highly ordered systems with low algorithmic complexity.This suggests that we can make sense at a particular level of nesting, becausethe algorithmic complexity of the system is less than the algorithmic complexityof our cognitive processes. While we may not have the computational capacityto understand the system in its entirety, we appear to have the capacity tounderstand the much less algorithmically complex subroutines that are nestedwithin the system, and can model the manner in which these subroutines areintegrated into the whole. Such an approach may not always work, but inpractice it seems to work surprisingly often. However in reconstituting thesubroutines, we often need to allow for apparently random contributions; i.e.contributions that are outside our capability to comprehend. External impactsmay often appear as randomness. Whether hidden laws, noise, or quantumuncertainty, what seems to us like chance events in the macro world often onlyhave a minor impact on the part of the universe we can make sense of.

Perhaps we rely on phenomenological ways of sense making that avoids theGodel dilemma. The implicit or explicit axioms of phenomenological laws mayhave empirically selected one of two possible truth statements or even selecteda path through a binary tree of unprovable truth statements. This may allowus to describe highly algorithmically organised systems at the macro level to


a moderate degree of satisfaction. Nevertheless, it may be impossible to relatethese higher level understandings to more fundamental understandings becauseof our cognitive limitations. When the whole cannot be explained by combiningthe parts, it might be because our brains cannot operate outside strict formallogical processes.

It could even be that the uncertainty principle arises through such a pro-cess. While, the great physicists and mathematicians, including Godel, havebeen sceptical about any relationship between his theorem and the uncertaintyprinciple, and while the formal framework of quantum physics does not allowfor hidden variables, who knows?

Appendix A

Exact cost of disturbance

Zurek [114] has shown that H(i|o) bits are required to restore a system fromthe output state o to the input state i when information has been lost in goingfrom i to o. In this case, H(o) < H(i) +H(o|i) as bits are removed . Howeverin general, as H(i|o) 6= H(o|i), H(i|o) does not match the information lostin going from i to o unless the computation is reversible. As a consequencesthere does not appear to be simple relationship between the entropy impact ofthe disturbance and the entropy of the restoration process. Zurek has arguedthat a slightly weaker requirement that looks at restoring the system from o∗,the algorithmically compressed description o∗ rather than o itself, avoids somelimitations of the strong requirement. This weaker version identifies the costin bits of restoring i form o as not less than;

H(i|o∗) = |p∗reg| = H(i|o,H(o)) = H(i)−H(o), (A.1)

where the restoration programme is given by p∗reg. As knowing o∗ is equivalentto knowing both o and H(o), the weaker version requires fewer bits as necessaryinformation is captured in H(o).

But

U(p∗reg, oH(o)) = U(p∗regH(o), o) = i,

as H(o) can be part of the programme string rather than the input string. Thisimplies that H(i|o) = H(i|o∗) + |H(o)| or H(i|o) = H(i|o∗) + log2H(o) (seealso Zurek [114].

The weaker version will be used to characterize the disturbance and anyregulatory response as

• As the stronger version of the regulatory response must replace the losthistory in generating o, the regulatory response in the stronger versionis not as simply related to the algorithmic entropy introduced by thedisturbance.

• The weaker version is antisymmetric in that the number of bits introducedinto the system in going from o to i= H(i|o∗), the number of bits removed

177

178 APPENDIX A. EXACT COST OF DISTURBANCE

in going from i to o. The entropy cost of restoring the system after adisturbance is the same as the entropy change to the system, but thedirection of flow is opposite. A corollary is that the net entropy flowaround a closed cycle is zero.

• The entropy difference H(i) − H(o) is independent of the UTM and ispath independent.

Bibliography

[1] C. Adami. What is complexity. BioEssays, 24:1085–1094, 2002.

[2] J. Alebraheem and Y. Abu-Hasan. Persistence of predators in a twopredators- one prey model with non-periodic solution. Applied Mathe-matical Sciences, 6(19):943–956, 2012.

[3] W. R. Ashby. Introduction to Cybernetics. University Paperbacks, Lon-don, 1964.

[4] A. Barron and T. Cover. Minimum complexity density estimation. IEEETrans. on Information Theory, 37(4):1034–1054, 1991.

[5] A.R. Barron, J. Rissanen, and B. Yu. The mdl principle in modeling andcoding. IEEE Trans. on Information Theory, IT-44(6):2743–2760, 1998.

[6] J. D. Barrow. Impossibility: The limits of science and the science oflimits. Random House, London, ISBN:0-09-0772116, 1999.

[7] S. Beer. The Heart Of Enterprise. Wiley, Chichester, 1979.

[8] S. Beer. Brain of the Firm, 2nd ed.,. John Wiley & Sons, Chichester,1981.

[9] C. H. Bennett. Logical reversibility of computation. IBM Journal ofResearch and Development, 17:525–532, 1973.

[10] C. H. Bennett. Thermodynamics of computation- a review. InternationalJournal of Theoretical Physics, 21(12):905–940, 1982.

[11] C. H. Bennett. Logical depth and physical complexity. In R Herken,editor, The Universal Turing Machine- a Half-Century Survey, pages227–257. Oxford University Press, Oxford, 1988.

[12] C. H. Bennett. Notes on landauer’s principle, reversible computation,and maxwell’s demon. http://xxx.lanl.gov/PS_cache/physics/pdf/0210/0210005.pdf, 2003.

[13] C. H. Bennett, P. Gacs, M. Li, P. M. B. Vitanyi, and W. H. Zurek. Infor-mation distance. IEEE Transactions on Information Theory, 44(4):1407–1423, 1998.

179

180 BIBLIOGRAPHY

[14] L. Brillouin. Science and Information Theory. Academic press, NewYork, 2nd edition, 1962.

[15] H Buhrman, J. Tromp, and P. Vitanyi. Time and space bounds forreversible simulation. Journal of Physics A: Mathematical and General,34:35:6821–6830, 2001.

[16] C. Calude. Information and Randomness:An Algorithmic perspective.Springer-Verlag, 2nd edition, 2002.

[17] C. Calude, D. I. Campbell, K. Svozil, and D. Stefanescu. Strong deter-minism vs. computability. In W. Depauli-Schimanovich, E. Koehler, andF. Stadler, editors, The Foundational Debate, Complexity and Construc-tivity in Mathematics and Physics, pages 115–131. Kluwer,, Dordrecht,Boston, London, 1995.

[18] C. Calude, M. J. Dinneen, and C. K. Shu. Computing a glimpse ofrandomness. Experimental Mathematics, 11(3):361–370, 2002.

[19] C. S. Calude and F. W. Meyerstein. Is the universe lawful ? Chaos,Solitons & Fractals, 106:1075–1084, 1999.

[20] J. Casti. The great ashby:complexity, variety, and information. Com-plexity, 2(1):7–9, 1996.

[21] J. L. Casti. Complexification:Explaining a Paradoxical World Throughthe Science of Surprise. Harpercollins, 1995.

[22] J. L. Casti and A. Karlqvist, editors. Boundaries and Barriers. Addis-onWiley, New York, 1996.

[23] John L. Casti and C. Calude. Randomness: the jumble cruncher. NewScientist, 183(2466):36–37, 2004.

[24] G. Chaitin. On the length of programs for computing finite binary se-quences. J. ACM, 13:547–569, 1966.

[25] G. Chaitin. A theory of program size formally identical to informationtheory. Journal of the ACM, 22:329–340, 1975.

[26] G. Chaitin. Toward a mathematical definition of ”life”. In R. D. Levineand M. Tribus, editors, The Maximum Entropy formalism, pages 477–498. MIT Press, Massachusetts, 1979.

[27] G. Chaitin. On the intelligibility of the universe and the notions of sim-plicity, complexity and irreducibility. In W. Hogrebe and J. Bromand,editors, Grenzen und Grenzberschreitungen, XIX. Deutscher Kongressfr Philosophie, Bonn, September 2002, pages 517–534, Berlin, 2004.Akademie Verlag.

BIBLIOGRAPHY 181

[28] G. J. Chaitin. Information-theoretic limitations of formal systems. J.ACM, 21(3), 1974.

[29] G. J. Chaitin. Algorithmic information theory. In Encyclopedia of Sta-tistical Sciences, volume 1, pages 38–41. Wiley, New York, 1982.

[30] G. J. Chaitin. Godel’s theorem and information. International Journalof Theoretical Physics, 22:941–954, 1982.

[31] G. J. Chaitin. Information, randomness & incompleteness : papers onalgorithmic information theory. World Scientific, New Jersey, 1987.

[32] G. J. Chaitin. Randomness in arithmetic. Scientific American, 259:80–85,1988.

[33] G. J. Chaitin. Randomness and complexity in pure mathematics. Inter-national Journal of Bifurcation and Chaos, 4:3–15, 1994.

[34] G. J. Chaitin. the Limits of Mathematics. Springer-Verlag, London, ISBN1-85233-668-4, 2003.

[35] G. J. Chaitin. Meta math! The quest for Omega. Pantheon ISBN:0375423133, 2005.

[36] G. J. Chaitin. The limits of reason. Scientific American, 294:74–81, 2006.

[37] T. M. Cover, P. Gcs, and R. M. Gray. Kolmogorovs contributions toinformation theory and algorithmic complexity. Ann. Probab, 17:840–865, 1989.

[38] D. Crutchfield. The calculi of emergence: Computational, dynamics, andinduction. Physica D, 75:11–54, 1994.

[39] J. P. Crutchfield and C. R. Shalizi. Thermodynamic depth of causalstates: When paddling around in occam’s pool shallowness is a virtue.Working paper 98-06-047, Santa Fe Insitute, 1999.

[40] J. P. Crutchfield and K. Young. Inferring statistical complexity. PhysicalReview Letters, 63:105–108, 1989.

[41] Feldman D. Computational Mechanics of Classical Spin Systems. Ph.d.,University of California, Davis, 1998.

[42] Saswato Das. Quantum threat to our secret data. New Scientist, 13September(2621), 2007.

[43] P. C. W. Davies. The Fifth Miracle: the Search for the Origin of Life.Penguin Books Ltd, 2003.

[44] D. Deutsch. Quantum theory, the church-turing principle and the uni-versal quantum computer. Proceedings of the Royal Society, Series A,400:97–117, 1985.

182 BIBLIOGRAPHY

[45] S. D. Devine. The application of algorithmic information theory to noisypatterned strings. Complexity, 12(2):52–58, 2006.

[46] S. D. Devine. An algorithmic information theory approach to the emer-gence of order using simple replication models. In First InternationalConference on the Evolution and Development of the Universe. 2008.

[47] S. D. Devine. The insights of algorithmic entropy. Entropy, 11(1):85–110,2009.

[48] Sean D. Devine. Understanding how replication processes can main-tain systems away from equilibrium using algorithmic information theory.Biosystems, 140:8 – 22, 2016.

[49] Digitial physics. http://en.wikipedia.org/wiki/Digital_physics.

[50] M. Eigen. Selforganization of matter and the evolution of biologicalmacromolecules. Naturwissenshaften, 58:465–523, 1971.

[51] M. Eigen, J. McCaskill, and P. Schuster. Molecular quasi-species. TheJournal of Physical Chemistry, 92(24):6881–6891, 1988.

[52] S. Feferman. Gdel, nagel, minds, and machines: Ernest nagel lecture,columbia university, sept. 27, 2007. The Journal of Philosophy, 106:201–219, 2009.

[53] D. P. Feldman and J. P. Crutchfield. Discovering noncritical organization:Statistical mechanical, information theoretic, and computational views ofpatterns in one-dimensional spin systems. Technical Report SFI WorkingPaper 98-04-026, 1998.

[54] E. Fredkin. Digital mechanics. Physica D, pages 254–270, 1990.

[55] E. Fredkin and T. Toffoli. Conservative logic. International Journal ofTheoretical Physics, 21:219–253, 1982.

[56] P. Gacs. On the symmetry of algorithmic information. Soviet Mathemat-ics Doklady, 15:1477–1780, 1974.

[57] P. Gacs. Exact expressions for some randomness tests. Zeitschrift furmathematische Logik und Grundlagen der Mathematik, 26:385–394, 1980.

[58] P. Gacs. Lecture notes on descriptional complexity and randomness. Lec-ture notes, Boston University Computer Science Department, 1988. Upto date version at http://www.cs.bu.edu/~gacs/papers/ait-notes.pdf.

[59] P. Gacs. The Boltzmann entropy and randomness tests- extended ab-stract. In Proceedings of the Workshop on Physics and Computation,pages 209–216. IEEE Computer Society Press, 1994.

BIBLIOGRAPHY 183

[60] P. Gacs. The Boltzmann entropy and random tests. Technical report,Boston University Computer Science Department, 2004. http://www.cs.bu.edu/~gacs/papers/ent-paper.pdf.

[61] G. F. Gause. The Struggle for existence:Systems thinking and thinkingsystems. MD: Williams and Wilkins, Baltimore, 1934.

[62] P.D. Grunwald. Tutorial on minimum description length. In P.D.Grunwald, I.J. Myung, and M.A. Pitt, editors, Advances in MinimumDescription Lengh:Theory and Applications. MIT Press, 2005.

[63] Price H. Time’s Arrow and Archimedes’ Point: New Directions for thePhysics of Time. Oxford University PressRandom House, New York,1996.

[64] M. H. Hansen and B. Yu. Model selection and the principle of mini-mum description length. Journal of the American Statistical Association,96:746 774, 2001.

[65] M. Hutter. A theory of universal artificial intelligence based on algorith-mic complexity. http://arxiv.org/abs/cs.AI/0004001, 2000.

[66] M. Hutter. Universal Artificial Intelligence. Springer, ISBN: 3-540-22139-5, London, 2005.

[67] M. Hutter. On generalized computable universal priors and their conver-gence. http://arxiv.org/PS_cache/cs/pdf/0503/0503026v1.pdf, 2005b.

[68] M. Hutter. Universal algorithmic intelligence: A mathematicaltop→down approach. In B. Goertzel and C. A. Pennachin, editors, Ar-tificial General Intelligence, pages 227–290. Springer, Berlin, 2007.

[69] E. T. Jaynes. Information theory and statistical mechanics. PhysicalReview, 106:620–630, 1957.

[70] E. T. Jaynes. Where do we stand on maximum entropy. In R.D. Levineand M. Tribus, editors, The Maximum Entropy formalism, pages 15–118.MIT Press, Cambridge, Massachusetts, 1979.

[71] E. T. Jaynes and G. L. Bretthorst. Probability theory: the logic of science.Cambridge University Press, New York, 2003.

[72] K. Kolmogorov. Three approaches to the quantitative definition of infor-mation. Prob. Info. Trans., 1:1–7, 1965.

[73] E. Kvaalen. Has the riemann hypothesis finally been proven? NewScientist, (2648):40–41, 2008.

[74] R. Landauer. Irreversibility and heat generation in the computing pro-cess. IBM Journal of Research and Development, 5:183–191, 1961.

184 BIBLIOGRAPHY

[75] R. Landauer. Information is physical. In Proceedings of PhysComp 1992,pages 1–4. Los Alamitos: IEEE Computer Society Press, Oxford, 1992.

[76] H. S. Leff and A. F. Rex. Maxwell’s Demon: Entropy, Information,computing. Princeton University Press, Princeton, 1990.

[77] L.A. Levin. On the notion of a random sequence. Soviet MathematicsDoklady, 14(5):1413–1416, 1973.

[78] L.A. Levin. Laws of information (nongrowth) and aspects of the foun-dation of probability theory. Problems of Information Transmission,10(3):206–210, 1974.

[79] M. Li and P. M. B. Vitanyi. An introduction to Kolmogorov Complexityand its Applications. Springer-Verlag, New York, third edition, 2008.

[80] M. Li and P.M.B. Vitanyi. Reversibility and adiabatic computation:trading time and space for energy. Proceedings of the Royal Society ofLondon, Series A, 452:769–789, 1996.

[81] S. Lloyd. Programming the Universe: A Quantum Computer ScientistTakes On the Cosmos. Knopf, ISBN 1-4000-4092-2, New York, 2006.

[82] Seth Lloyd. Computational capacity of the universe. Phys. Rev. Lett.,88:237901, May 2002.

[83] P. Martin-Lof. The definition of random sequences. Information andControl, 9:602–619, 1966.

[84] M. Minsky. Size and structure of a universal turing machine using tagsystems. In In Recursive Function Theory: Proceedings, Symposium inPure Mathematics, volume 5, pages 229–238, Provelance, 1962. AMS.

[85] Flying in v-formation gives best view for least effort. New Scientist,(April), 2007.

[86] J. Rissanen. Generalized kraft inequality and arithmetic coding. IBMJournal of Research and Development, 20(3):198–203, 1976.

[87] J. Rissanen. Modeling by the shortest data description. Automatica,14:465–471, 1978.

[88] J. Rissanen. A universal prior for integers and estimation by minimumdescription length. Annals of Statistics, 11(2):416–431, 1983.

[89] J. Rissanen. Stochastic complexity and modeling. Annals of Statistics,14:1080–1100, 1986.

[90] J. Rissanen. Stochastic complexity. J. Royal Statistical Society,49B(3):223–265 and 252–265, 1987.

BIBLIOGRAPHY 185

[91] J. Rissanen. Stochastic Complexity in Statistical Inquiry. World Scien-tific, New Jersey, 1989.

[92] E. D. Schneider and J. J. Kay. Life as a manifestation of the second lawof thermodynamcis. Mathl. Comput. Modelling, 16(6-8):25–48, 1994.

[93] C. P. Schnorr. Process complexity and effective random tests. Journalof Computer and System Sciences, 7:376–388, 1973.

[94] C. R. Shalizi and J. P. Crutchfield. Pattern discovery and computationalmechanics. arxiv.org/abs/cs/0001027v1, 2000.

[95] C. R. Shalizi and P. Crutchfield. Computational mechanics patternand prediction, structure and simplicity. Journal of Statistical Physics,104:819–881, 2001.

[96] C. E. Shannon. A mathematical theory of communication. Bell SystemTech. J., 27:379–423, 1948.

[97] R. J. Solomonoff. A formal theory of inductive inference; part 1 and part2. Information and Control, 7:1–22,224–254, 1964.

[98] R. J. Solomonoff. Complexity-based induction systems: Comparisons andconvergence theorems. IEEE Trans. Information Theory, IT-24(4):422–432, 1978.

[99] R. J. Solomonoff. Lecture 1: Algorithmic probability. http://world.std.com/~rjs/iaplect1.pdf, 2005.

[100] R. J. Solomonoff. Lecture 2. applications of algorithmic probability. http://world.std.com/~rjs/iaplect1.pdf, 2005.

[101] E Szathmary and J Maynard Smith. From replicators to reproducers:the first major transitions leading to life. Journal of Theoretical Biology,187:555–571, 1997.

[102] L. Szilard. On the decrease of entropy in a thermodynamic system bythe intervention of intelligent beings, 2003.

[103] S. Szilard. Uber die entropieverminderung in einnem thermodynamischensystem bei eingriffen intelligenter wesen. Zeitschrift fr Physik, 53:840–856, 1929.

[104] T. Toffoli. Physics and computation. International Journal of TheoreticalPhysics, 21:165–175, 1982.

[105] R. C. Tolman. Statistical Mechanics. Oxford University Press, London,1950.

[106] A. Turing. On computable numbers, with an application to the entschei-dungsproblem. Proceedings of the London Mathematical Society, Series2, 42, 1936.

186 BIBLIOGRAPHY

[107] Wolfram’s 2,3 turing machine is universal, 2007.

[108] N. K. Vereshchagin and P. M. B. Vitanyi. Kolmogorovs structure func-tions and model selection. IEEE Transactions on Information Theory,50(12):3265–3290, 2004.

[109] P. M. B. Vitanyi and M. Li. Minimum description length induction,bayesianism, and kolmogorov complexity. IEEE Trans. Inform. Theory,46:446–464, 2000.

[110] P.M.B. Vitanyi. Time space and energy in reversible computing. InProceedings of the 2005 ACM International Conference on ComputingFrontiers, pages 435–444, Ischia, Italy, 2005.

[111] R. von Mises. Grundlagen der wahrscheinlichkeitsrechnung. Mathema-tische Zeitschrift, 5:52, 1919.

[112] S. Wolfram. A New Kind of Science. Wolfram Media, Champaign, IL,2002.

[113] W. H. Zurek. Algorithmic randomness and physical entropy. PhysicalReview A, 40(8):4731–4751, 1989.

[114] W. H. Zurek. Thermodynamics of of computation, algorithmic complex-ity and the information metric. Nature, 341:119–124, 1989.

Documents

ALGORITHMICINFORMATIONTHEORY ...algoinfo4u.com/AlgoInfo4u.pdf · Nevertheless the book pointed to Kolmogorov’s work on algorithmic complex-ity, probability theory and randomness