Engineering General Intelligence Appendices B-H

8/12/2019 Engineering General Intelligence Appendices B-H

http://slidepdf.com/reader/full/engineering-general-intelligence-appendices-b-h 1/104

Ben Goertzel with Cassio Pennachin & Nil Geisweiller &the OpenCog Team

Engineering General Intelligence:

APPENDICES B-H

December 14, 2013





Contents

B Steps Toward a Formal Theory of Cognitive Structure and Dynamics . . . . . 1B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1B.2 Modeling Memory Types Using Category Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

B.2.1 The Category of Procedural Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2B.2.2 The Category of Declarative Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2B.2.3 The Category of Episodic Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3B.2.4 The Category of Intentional Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3B.2.5 The Category of Attentional Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

B.3 Modeling Memory Type Conversions Using Functors . . . . . . . . . . . . . . . . . . . . . . . . 3B.3.1 Converting Between Declarative and Procedural Knowledge . . . . . . . . . . . . 3B.3.2 Symbol Grounding: Converting Between Episodic and Declarative

Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4B.3.3 Converting Between Episodic and Procedural Knowledge . . . . . . . . . . . . . . 7B.3.4 Converting Intentional or Attentional Knowledge into Declarative or

Procedural Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7B.3.5 Converting Episodic Knowledge into Intentional or Attentional Knowledge 7

B.4 Metrics on Memory Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7B.4.1 Information Geometry on Memory Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 8B.4.2 Algorithmic Distance on Memory Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

B.5 Three Hypotheses About the Geometry of Mind . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10B.5.1 Hypothesis 1: Syntax-Semantics Correlation . . . . . . . . . . . . . . . . . . . . . . . . . 10B.5.2 Hypothesis 2: Cognitive Geometrodynamics . . . . . . . . . . . . . . . . . . . . . . . . . . 10B.5.3 Hypothesis 3: Cognitive Synergy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

B.6 Next Steps in Rening These Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12B.7 Returning to Our Basic Claims About CogPrime . . . . . . . . . . . . . . . . . . . . . . . . . . 12

C Emergent Reexive Mental Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17C.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17C.2 Hypersets and Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

C.2.1 Hypersets as Patterns in Physical or Computational Systems . . . . . . . . . . 19C.3 A Hyperset Model of Reective Consciousness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20C.4 A Hyperset Model of Will . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

C.4.1 In What Sense Is Will Free? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

v



vi Contents

C.4.2 Connecting Will and Consciousness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26C.5 A Hyperset Model of Self . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26C.6 Validating Hyperset Models of Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28C.7 Implications for Practical Work on Machine Consciousness . . . . . . . . . . . . . . . . . . . 29

C.7.1 Attentional Focus in CogPrime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29C.7.2 Maps and Focused Attention in CogPrime . . . . . . . . . . . . . . . . . . . . . . . . . . 30C.7.3 Reective Consciousness, Self and Will in CogPrime . . . . . . . . . . . . . . . . . 31C.7.4 Encouraging the Recognition of Self-Referential Structures in the

AtomSpace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32C.8 Algebras of the Social Self . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33C.9 The Intrinsic Sociality of the Self . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33C.10 Mirror Neurons and Associated Neural Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

C.10.1 Mirror Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35C.11 Quaternions and Octonions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36C.12 Modeling Mirrorhouses Using Quaternions and Octonions . . . . . . . . . . . . . . . . . . . 38C.13 Specic Instances of Mental Mirrorhousing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43C.14 Mirroring in Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45C.15 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

D GOLEM: Toward an AGI Meta-Architecture Enabling Both GoalPreservation and Radical Self-Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47D.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47D.2 The Goal Oriented Learning Meta-Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

D.2.1 Optimizing the GoalEvaluator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50D.2.2 Conservative Meta-Architecture Preservation . . . . . . . . . . . . . . . . . . . . . . . . 51D.2.3 Complexity and Convergence Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

D.3 The Argument For GOLEM’s Steadfastness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52D.4 A Partial Formalization of the Architecture and Steadfastness Argument . . . . . . 52

D.4.1 Toward a Formalization of GOLEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53D.4.2 Some Conjectures About GOLEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

D.5 Comparison to a Reinforcement Learning Based Formulation . . . . . . . . . . . . . . . . . 55D.6 Specifying the Letter and Spirit of Goal Systems (Are Both Difficult Tasks) . . . . 56D.7 A More Radically Self-Modifying GOLEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57D.8 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

E Lojban++: A Novel Linguistic Mechanism for Teaching AGI Systems . . . . 61E.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61E.2 Lojban versus Lojban++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62E.3 Some Simple Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63E.4 The Need for Lojban Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65E.5 Lojban and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

E.5.1 Lojban versus Predicate Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66E.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67E.7 Postscript: Basic Principles for Using English Words in Lojban++ . . . . . . . . . . . . 68E.8 Syntax-based Argument Structure Conventions for English Words . . . . . . . . . . . . 69E.9 Semantics-based Argument Structure Conventions for English Words . . . . . . . . . . 70E.10 Lojban gismu of clear use within Lojban++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72



Contents vii

E.11 Special Lojban++ cmavo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72E.11.1 qui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73E.11.2 it, quu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73E.11.3 quay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

F PLN and the Brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75F.1 How Might Probabilistic Logic Networks Emerge from Neural Structures and

Dynamics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75F.2 Avoiding Issues with Circular Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77F.3 Neural Representation of Recursion and Abstraction . . . . . . . . . . . . . . . . . . . . . . . . 79

G Possible Worlds Semantics and Experiential Semantics . . . . . . . . . . . . . . . . . . . 81G.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81G.2 Inducing a Distribution over Predicates and Concepts . . . . . . . . . . . . . . . . . . . . . . . 82G.3 Grounding Possible Worlds Semantics in Experiential Semantics . . . . . . . . . . . . . . 83G.4 Reinterpreting Indenite Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

G.4.1 Reinterpreting Indenite Quantiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

G.5 Specifying Complexity for Intensional Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88G.6 Reinterpreting Implication between Inheritance Relationships . . . . . . . . . . . . . . . . 88G.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

H Propositions About Environments in Which CogPrime Components AreUseful . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91H.1 Propositions about MOSES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

H.1.1 Proposition: ENF Helps to Guide Syntax-Based Program Space Search . . 91H.1.2 Demes are Useful if Syntax/Semantics Correlations in Program Space

Have a Small Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92H.1.3 Probabilistic Program Tree Modeling Helps in the Presence of

Cross-Modular Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92H.1.4 Relating ENF to BOA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

H.1.5 Conclusion Regarding Speculative MOSES Theory . . . . . . . . . . . . . . . . . . . . 93H.2 Propositions About CogPrime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94H.2.1 When PLN Inference Beats BOA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94H.2.2 Conditions for the Usefulness of Hebbian Inference Control . . . . . . . . . . . . 94H.2.3 Clustering-together of Smooth Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95H.2.4 When PLN is Useful Within MOSES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95H.2.5 When MOSES is Useful Within PLN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95H.2.6 On the Smoothness of Some Relevant Theorems . . . . . . . . . . . . . . . . . . . . . . 96H.2.7 Recursive Use of “MOSES with PLN” to Help With Attention Allocation 96H.2.8 The Value of Conceptual Blending . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97H.2.9 A Justication of Map Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

H.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97





Appendix BSteps Toward a Formal Theory of CognitiveStructure and Dynamics

B.1 Introduction

Transforming the conceptual and formal ideas of Section ?? into rigorous mathematical theorywill be a large enterprise, and is not something we have achieved so far. However, we do believewe have some idea regarding what kind of mathematical and conceptual toolset will be useful forenacting this transformation. In this appendix we will elaborate our ideas regarding this toolset,and in the process present some concrete notions such as a novel mathematical formulation of the concept of cognitive synergy, and a more formal statement of many of the "key claims"regarding CogPrime given in Chapter ?? .

The key ideas involved here are: modeling multiple memory types as mathematical categories(with functors mapping between them), modeling memory items as probability distributions,and measuring distance between memory items using two metrics, one based on algorithmicinformation theory and one on classical information geometry. Building on these ideas, corehypotheses are then presented:

• a syntax-semantics correlation principle, stating that in a successful AGI system, thesetwo metrics should be roughly correlated

• a cognitive geometrodynamics principle, stating that on the whole intelligent mindstend to follow geodesics (shortest paths) in mindspace, according to various appropriatelydened metrics (e.g. the metric measuring the distance between two entities in terms of thelength and/or runtime of the shortest programs computing one from the other).

• a cognitive synergy principle, stating that shorter paths may be found through the com-posite mindspace formed by considering multiple memory types together, than by followingthe geodesics in the mindspaces corresponding to individual memory types.

These ideas are not strictly necessary for understanding the CogPrime design as outlined inPart 2 of this book. However, our hope is that they will be helpful later on for elaborating adeeper theoretical understanding of CogPrime, and hence in developing the technical aspectsof the CogPrime design beyond the stage presented in Part 2. Our sense is that, ultimately,the theory and practice of AGI will both go most smoothly if they can proceed together,with theory guiding algorithm and architecture tuning, but also inspired by lessons learned viapractical experimentation. At present the CogPrime design has been inspired by a combinationof broad theoretical notions about the overall architecture, and specic theoretical calculations

1



2 B Steps Toward a Formal Theory of Cognitive Structure and Dynamics

regarding specic components. One of our hopes is that in later versions of CogPrime, precisetheoretical calculations regarding the overall architecture may also be possible, perhaps usingideas descending from those in this appendix.

B.2 Modeling Memory Types Using Category Theory

We begin by formalizing the different types of memory critical for a human-like integrative AGIsystem, in a manner that makes it easy to study mappings between different memory types. Oneway to do this is to consider each type of memory as a category , in the sense of category theory.Specically, in this section we roughly indicate how one may model declarative, procedural,episodic, attentional and intentional categories, thus providing a framework in which mappingbetween these different memory types can be modeled using functors. The discussion is quitebrief and general, avoiding commitments about how memories are implemented.

B.2.1 The Category of Procedural Memory

We model the space of procedures as a graph . We assume there exists a set T of “atomictransformations” on the category C Proc of procedures, so that each t ∈ T maps an inputprocedure into a unique output procedure. We then consider a labeled digraph whose nodesare objects in C Proc (i.e. procedures), and which has a link labeled t between procedure P 1and P 2 if t maps P 1 into P 2 . Morphisms on program space may then be taken as paths in thisdigraph, i.e. as composite procedure transformations dened by sequences of atomic proceduretransformations.

As an example, if procedures are represented as ensembles of program trees , where programtrees are dened in the manner suggested in [ ?] and ?? , then one can consider tree edit operationsas dened in [?] as one’s atomic transformations. If procedures are represented as formal neuralnets or ensembles thereof, one can take a similar approach.

B.2.2 The Category of Declarative Memory

The category C Dec of declarative knowledge may be handled somewhat similarly, via assumingthe existence of a set of transformations between declarative knowledge items, constructinga labeled digraph induced by these transformations, and dening morphisms as paths in thisdigraph. For example, if declarative knowledge items are represented as expressions in somelogical language, then transformations may be naturally taken to correspond to inference stepsin the associated logic system. Morphisms then represent sequences of inference steps thattransform one logical expression into another.



B.3 Modeling Memory Type Conversions Using Functors 3

B.2.3 The Category of Episodic Memory

What about episodic memory – the record an intelligence keeps of its own experiences? Giventhat we are talking about intelligences living in a world characterized by three spatial dimensionsand one temporal dimension, one way to model a remembered episode (i.e., an object in thecategory C Ep of episodic memories) is as as a scalar eld dened over a grid-cell discretizationof 4D spacetime. The scalar eld, integrated over some region of spacetime, tells the extent towhich that region belongs to the episode. In this way one may also consider episodes as fuzzysets of spacetime regions. We may then consider a category whose objects are episode-sets , i.e.fuzzy sets of fuzzy sets of spacetime regions.

To dene morphisms on the space of episode-sets, one approach is to associate an episode E with the set P E , of programs that calculate the episode within a given error . One may thenconstruct a graph whose nodes are episode-sets, and in which E 1 is linked to E 2 if applying anatomic procedure-transformation to some program in P E 1 , yields a program in P E 2 , .

B.2.4 The Category of Intentional Memory

To handle the category C Int of intentional knowledge, we recall that in our formal agents model,goals are functions. Therefore, to specify the category of goals a logic of functions may be used(e.g. as in [?]), transformations corresponding to logical inference steps in the logic of functions.

B.2.5 The Category of Attentional Memory

Finally, the category C Att of attentional knowledge is handled somewhat similarly to goals.Attentional evaluations may be modeled as maps from elements of C Int ∪C Dec ∪C Ep ∪C Proc

into a space V of AttentionValues. As such, attentional evaluations are functions, and may beconsidered as a category in a manner similar to the goal functions.

B.3 Modeling Memory Type Conversions Using Functors

Having modeled memory types as categories, we may now model conversions between memorytypes as mappings between categories. This is one step on the path to formalizing the notionof cognitive synergy within the formal cognitive architecture presented in the previous section.

B.3.1 Converting Between Declarative and Procedural Knowledge

To understand conversion back and forth between declarative and procedural knowledge, con-sider the cases:




• the category blue versus the procedure isBlue that outputs a number in [0, 1] indicatingthe degree of blueness of its input

• the statement “the sky is blue” versus the procedure that outputs a number [0, 1] indicatingthe degree to which its input is semantically similar to the statement “the sky is blue”

• a procedure for serving a tennis ball on the singles boundary at the edge of the service box,as close as possible to the net; versus a detailed description of this procedure, of the sortthat could be communicated verbally (though it might take a long time)

• a procedure for multiplying numbers, versus a verbal description of that procedure

• a logical description of the proof of a theorem based on some axioms; versus a procedurethat produces the theorem given the axioms as inputs

From these examples we can see that procedural and declarative knowledge are in a senseinterchangeable; and yet, some entities seem more naturally represented procedurally, whereasother seem more naturally represented declaratively. Relatedly, it seems that some knowledgeis more easily obtained via learning algorithms that operate on procedural representations; andother knowledge is more easily obtained via learning algorithms that operate on declarativerepresentations.

Formally, we may dene a “procedure declaratization” as a functor from K Proc to K Dec ; inother words, a pair of mappings (r, s ) so that

• r maps each object in K Proc into some object in K Dec

• s maps each morphism f Proc,i in K Proc into some morphism in K Dec , in a way that obeyss(f Proc,i ◦f Proc,j ) = s(f Proc,i ) ◦s(f Proc,j )

Similarly, we may dene a “declaration procedurization” as a functor from K Dec to K Proc .

B.3.2 Symbol Grounding: Converting Between Episodic and Declarative Knowledge

Next we consider converting back and forth between episodic and declarative knowledge. Partic-ular cases of this conversion have received signicant attention in the cognitive science literature,referred to by the term “symbol grounding.”

It is relatively straightforward to dene “episode declaratization” and “declaration episodiza-tion” functors formally in the manner of the above denitions regarding declarative/proceduralconversion. Conceptually,

• Episode declaratization produces a declaration describing an episode-set (naturally thisdeclaration may be a conjunction of many simple declarations)

• Declaration episodization produces an episode-set dened as the set of episodes whosedescriptions include a certain declaration

As a very simple example of declaration episodization: the predicate isCat (x) could be

mapped into the fuzzy set E of episodes containing cats, where the degree of membership of ein E could be measured as the degree to which e contains a cat. In this case, the episode-setwould commonly be called the “grounding” of the predicate. Similarly, a relationship such asa certain sense of the preposition “with” could be mapped into the set of episodes containingrelationships between physical entities that embody this word-sense.



B.3 Modeling Memory Type Conversions Using Functors 5

As a very simple example of episode declaratization: an episode that an agent experiencedwhile playing fetch with someone, could be mapped into a description of the episode includinginformation about the kind of ball being used in the “fetch” game, the name and other propertiesof the other person participating in the “fetch” game, the length of time the game lasted, etc.

B.3.2.1 Algorithmically Performing Episodic/Declarative Conversion

One way that these processes could occur in an intelligent system would be for episode declara-tization to guide both processes. That is, the system would need some capability to abstractdeclarative knowledge from observed or remembered episodes. Then, given a description, thesystem could carry out declaration episodization via solving the “inverse problem” of episodedeclaratization, i.e. given a declarative object D

1. First it could search episodic memory for episodes E i whose (stored or on-the-y-computed)descriptions fully or approximately match D

2. If any of the E i is extremely accurately describable by D , then it is returned as the answer

3. Otherwise, if some of the E i are moderately but not extremely accurately describable byD, they are used as initial guesses for local search aimed at nding some episode E whosedescription closely matches D

4. If no sufficiently promising E i can be found, then a more complex cognitive process is carriedout, for instance in an CogPrime system,

• Inference may be carried out to nd E i that lead to descriptions D i that are inferentiallyfound to be closely equivalent to D (in spite of this near-equivalence not being obviouswithout inference)

• Evolutionary learning may be carried out to “evolve” episodes, with the tness functiondened in terms of describability by D

B.3.2.2 Development of Better Symbol Groundings as Natural Transformation

As an application of the modeling of memory types as categories, it’s interesting to think aboutthe interpretation of functor categories and natural transformations in the context of memorytypes, and in particular in the context of “symbol groundings” of declarative knowledge inepisodic knowledge.

First of all, the functor category (C Ep )C Dec

• has as objects all functors from C Dec to C Ep (e.g. all methods of assigning experiencesto the sets of declarations satisfying them, which nicely map transformation paths intotransformation paths)

• has as morphisms the natural transformations between these functors.

That is, suppose F and G are functors between C Dec and C Ep ; that is, F and G are

two different ways of grounding declarative knowledge in episodic knowledge. Then, a naturaltransformation η from F to G associates to every object X in C Dec (i.e., to every declarationX ) a morphism ηX : F (X ) → G(X ) in C Ep (that is, ηX is a composite transformation mappingthe episode-set F (X ) into the episode-set G(X )) so that: for every morphism f : X → Y inC Dec we have ηY ◦F (f ) = G(f ) ◦ηX .




An easier way to conceptualize this may be to note that in the commutative diagram

F (X ) F ( f )

ηX

F (Y )

ηY

G(X )

G ( f ) G(Y )

we have a situation where

• X and Y represent declarations

• f represents a sequence of atomic transformations between declarations

• all corners of the diagram correspond to episode-sets

• all arrows correspond to sequences of atomic transformations between episode-sets

• ηX and ηY represent sequences of atomic transformations between episode-sets

In other words, a natural transformation between two methods of grounding is: a mappingthat assigns to each declaration, a morphism on episodic memory that preserves the commuta-tive diagram with respect to the two methods of grounding.

Cognitively, what this suggest is that developing better and better groundings is a mat-ter of starting with one grounding and then naturally transforming it into better and bettergroundings.

To make things a little clearer, we now present the above commutative diagram using a moretransparent, application-specic notation. Let us consider a specic example wherein:

• X is represented by the predicate isTiger , Y is represented by the predicate isCat

• f is represented by an example inference trail (i.e. transformation process) leading fromisTiger to isCat , which we will denote isa (tiger, cat )

• F and G are relabeled Grounding 1 and Grounding 2 (as these are two functors that ground

declarative knowledge in episodic knowledge)• F (X ) is relabeled T igerEpisodes 1 (as it’s the set of episodes associated with isTiger underthe grounding Grounding 1 ; similarly, F (Y ) is relabeled CatEpisodes 1 , G(X ) is relabeledTigerEpisodes 2 , and G(Y ) is relabeled CatEpisodes 2

• F (f ) is relabeled Grounding 1(isa (tiger, cat )) ; and G(f ) is relabeled Grounding 2(isa (tiger, cat ))

• ηX and ηY become ηisTiger and ηisCat respectively

With these relabelings, the above commutative diagram looks like

TigerEpisodes 1Grounding 1 ( isa ( tiger,cat ))

η isTiger

CatEpisodes 1

η isCat

TigerEpisodes 2 Grounding 2 ( isa ( tiger,cat ))

CatEpisodes 2

One may draw similar diagrams involving the other pairs of memory types, with similarinterpretations.



B.4 Metrics on Memory Spaces 7

B.3.3 Converting Between Episodic and Procedural Knowledge

Mapping between episodic and procedural knowledge may be done indirectly via the mappingsalready described above. Of course, such mappings could also be constructed directly but forour present purposes, the indirect approach will suffice.

Episode procedurization maps an episode-set into the set of procedures whose execution ispart of the description of the episode-set. A simple example of episode procedurization wouldbe: mapping a set of episodes involving playing “fetch” into procedures for coordinating thefetch game, throwing an object, catching an object, walking, and so forth.

Procedure episodization maps a procedure into the set of episodes appearing to containexecutions of the procedure. For instance, a procedure for playing fetch would map into a setof episodes involving playing fetch; or, a procedure for adding numbers would map into a set of episodes involving addition, which might include a variety of things such as:

• “textbook examples” such as: a set of two apples, and a set of three apples, merging to forma set of ve apples

• a nancial transaction at a cash register in a store, involving the purchase of several itemsand the summing of their prices into a composite price

B.3.4 Converting Intentional or Attentional Knowledge intoDeclarative or Procedural Knowledge

Attentional valuations and goals are considered as functions, thus, though they may be rep-resented in various “native” forms, their conversion into procedural knowledge is conceptuallystraightforward.

Conversion to declarative knowledge may occur by way of procedural knowledge, or may bemore easily considered directly in some cases. For instance, the assignment of attention values todeclarative knowledge items is easily represented as declarative knowledge, i.e. using statementsof the form “Knowledge item K 1 has attention value V 1 .”

B.3.5 Converting Episodic Knowledge into Intentional or Attentional Knowledge

Episodes may contain implicit information about which entities should be attended in whichcontexts, and which goals have which subgoals in which contexts. Mining this information isnot a simple process and requires application of signicant intelligence.

B.4 Metrics on Memory Spaces

Bringing together the ideas from the previous sections, we now explain how to use the aboveideas to dene geometric structures for cognitive space, via dening two metrics on the space of




memory store dynamic states . Specically, we dene the dynamic state or d-state of a memorystore (e.g. attentional, procedural, etc.) as the series of states of that memory store (as awhole) during a time-interval. Generally speaking, it is necessary to look at d-states ratherthan instantaneous memory states because sometimes memory systems may store information

using dynamical patterns rather than xed structures.It’s worth noting that, according to the metrics introduced here, the above-described map-pings between memory types are topologically continuous, but involve considerable geometricdistortion – so that e.g., two procedures that are nearby in the procedure-based mindspace,may be distant in the declarative-based mindspace. This observation will lead us to the notionof cognitive synergy, below.

B.4.1 Information Geometry on Memory Spaces

Our rst approach involves viewing memory store d-states as probability distributions. A d-state spanning time interval ( p, q ) may be viewed as a mapping whose input is the state of theworld and the other memory stores during a given interval of time (r, s ), and whose output isthe state of the memory itself during interval (t, u ). Various relations between these endpointsmay be utilized, achieving different denitions of the mapping e.g. p = r = t, q = s = u (inwhich case the d-state and its input and output are contemporaneous) or else p = r, q = s = t(in which case the output occurs after the simultaneous d-state and input), etc. In many casesthis mapping will be stochastic. If one assumes that the input is an approximation of the stateof the world and the other memory stores, then the mapping will nearly always be stochastic.So in this way, we may model the total contents of a given memory store at a certain point intime as a probability distribution. And the process of learning is then modeled as one of coupled changes in multiple memory stores , in such a way as to enable ongoingly improved achievementof system goals.

Having modeled memory store states as probability distributions, the problem of measuringdistance between memory store states is reduced to the problem of measuring distance betweenprobability distributions. But this problem has a well-known solution: the Fisher-Rao metric!

Fisher information is a statistical quantity which has a a variety of applications, rangingbeyond statistical data analysis, including physics [ ?], psychology and AI [ ?]. Put simply, FIis a formal way of measuring the amount of information that an observable random variableX carries about an unknown parameter θ upon which the probability of X depends. FI formsthe basis of the Fisher-Rao metric, which has been proved the only Riemannian metric onthe space of probability distributions satisfying certain natural properties regarding invariancewith respect to coordinate transformations. Typically θ in the FI is considered to be a realmultidimensional vector; however, [ ?] has presented a FI variant that imposes basically norestrictions on the form of θ, which is what we need here.

Suppose we have a random variable X with a probability function f (X, θ ) that dependson a parameter θ that lives in some space M that is not necessarily a dimensional space. LetE

⊆ R have a limit point at t

∈ R, and let γ : E

→ M be a path. We may then consider a

function G(t) = ln f (X, γ (t)) ; and, letting γ (0) = θ, we may then dene the generalized Fisher information as I (θ)γ = I X (θ)γ = E ∂

∂t ln f (X ; γ (t))2

|θ .



B.4 Metrics on Memory Spaces 9

Next, Dabak [ ?] has shown that the geodesic between θ and θ is given by the exponentialweighted curve (γ (t)) ( x) = f (x,θ ) 1 − t f (x,θ ) t

f (y,θ ) 1 − t f (y,θ ) t dy , under the weak condition that the log-likelihoodratios with respect to f (X, θ ) and f (X, θ ) are nite. It follows that if we use this form of curve,then the generalized Fisher information reduces properly to the Fisher information in the caseof dimensional spaces. Also, along this sort of curve, the sum of the Kullback-Leibler distancesbetween θ and θ , known as the J-divergence, equals the integral of the Fisher information alongthe geodesic connecting θ and θ .

Finally, another useful step for our purposes is to bring Fisher information together withimprecise and indenite probabilities as discussed in [ ?]. For instance an indenite probabilitytakes the form ((L, U ), k ,b) and represents an envelope of probability distributions, whose meansafter k more observations lie in (L, U ) with probability b. The Fisher-Rao metric betweenprobability distributions is naturally extended to yield a metric between indenite probabilitydistributions.

B.4.2 Algorithmic Distance on Memory Spaces

A conceptually quite different way to measure the distance between two d-states, on the otherhand, is using algorithmic information theory. Assuming a xed Universal Turing Machine M ,one may dene H (S 1 , S 2) as the length of the shortest self-delimiting program which, givenas input d-state S 1 , produces as output d-state S 2 . A metric is then obtained via settingd(S 1 , S 2) = ( H (S 1 , S 2) + H (S 2 , S 1)/ 2. This tells you the computational cost of transforming S 1into S 2 .

There are variations of this which may also be relevant; for instance [ ?] denes the generalizedcomplexity criterion K Φ (x) = min i∈N {Φ(i, τ i )|L( pi )) = x}, where L is a programming language, pi is the i’th program executable by L under an enumeration in order of nonincreasing programlength, τ i is the execution time of the program pi , L(x) is the result of L executing pi to obtainoutput x, and Φ is a function mapping pairs of integers into positive reals, representing the trade-

off between program length and memory. Via modulating Φ, one may cause this complexitycriterion to weight only program length (like standard algorithmic information theory), onlyruntime (like the speed prior), or to balance the two against each other in various ways.

Suppose one uses the generalized complexity criterion, but looking only at programs pi thatare given S 1 as input. Then K Φ (S 2), relative to this list of programs, yields an asymmetricdistance H Φ (S 1 , S 2), which may be symmetrized as above to yield dΦ (S 1 , S 2). This gives a moreexible measure of how hard it is to get to one of (S 1 , S 2) from the other one, in terms of bothmemory and processing time.

One may discuss geodesics in this sort of algorithmic metric space, just as in Fisher-Raospace. A geodesic in algorithmic metric space has the property that, between any two points onthe path, the integral of the algorithmic complexity incurred while following the path is less thanor equal to that which would be incurred by following any other path between those two points.The algorithmic metric is not equivalent to the Fisher-Rao metric, a fact that is consistent

with Cencov’s Theorem because the algorithmic metric is not Riemannian (i.e. it is not locallyapproximated by a metric dened via any inner product).




B.5 Three Hypotheses About the Geometry of Mind

Now we present three hypotheses regarding generally intelligent systems, using the conceptualand mathematical machinery we have built.

B.5.1 Hypothesis 1: Syntax-Semantics Correlation

The informational and algorithmic metrics, as dened above, are not equivalent nor neces-sarily closely related; however, we hypothesize that on the whole, systems will operate moreintelligently if the two metrics are well correlated, implying that geodesics in one space shouldgenerally be relatively short paths (even if not geodesics) in another.

This hypothesis is a more general version of the “syntax-semantics correlation" propertystudied in [ ?] in the context of automated program learning. There, it is shown empiricallythat program learning is more effective when programs with similar syntax also have similarbehaviors. Here, we are suggesting that an intelligent system will be more effective if memorystores with similar structure and contents lead to similar effects (both externally to the agent,and on other memory systems). Hopefully the basic reason for this is clear. If syntax-semanticscorrelation holds, then learning based on the internal properties of the memory store, can helpgure out things about the external effects of the memory store. On the other hand, if it doesn’thold, then it becomes quite difficult to gure out how to adjust the internals of the memory toachieve desired effects.

The assumption of syntax-semantics correlation has huge implications for the design of learn-ing algorithms associated with memory stores. All of CogPrime’s learning algorithms are builton this assumption. For, example CogPrime’s MOSES procedure learning component [ ?] as-sumes syntax-semantics correlation for individual programs, from which it follows that theproperty holds also on the level of the whole declarative memory store. And CogPrime’s PLNprobabilistic inference component [ ?] uses an inference control mechanism that seeks to guide

a new inference via analogy to prior similar inferences, thus embodying an assumption thatstructurally similar inferences will lead to similar behaviors (conclusions).

B.5.2 Hypothesis 2: Cognitive Geometrodynamics

In general relativity theory there is the notion of “geometrodynamics," referring to the feedbackby which matter curves space, and then space determines the movement of matter (via therule that matter moves along geodesics in curved spacetime) [ ?]. One may wonder whether ananalogous feedback exists in cognitive geometry. We hypothesize that the answer is yes, to alimited extent. On the one hand, according to the above formalism, the curvature of mindspaceis induced by the knowledge in the mind. On the other hand, one may view cognitive activity

as approximately following geodesics in mindspace.Let’s say an intelligent system has the goal of producing knowledge meeting certain charac-teristics (and note that the desired achievement of a practical system objective may be framedin this way, as seeking the true knowledge that the objective has been achieved). The goalthen corresponds to some set of d-states for some of the mind’s memory stores. A simplied but



B.5 Three Hypotheses About the Geometry of Mind 11

meaningful view of cognitive dynamics is, then, that the system seeks the shortest path from thecurrent d-state to the region in d-state space comprising goal d-states. For instance, consideringthe algorithmic metric, this reduces to the statement that at each time point, the system seeksto move itself along a path toward its goal, in a manner that requires the minimum computa-

tional cost – i.e. along some algorithmic geodesic. And if there is syntax-semantics correlation,then this movement is also approximately along a Fisher-Rao geodesic.And as the system progresses from its current state toward its goal-state, it is creating new

memories – which then curve mindspace, possibly changing it substantially from the shape ithad before the system started moving toward its goal. This is a feedback conceptually analogousto, though in detail very different from, general-relativistic geometrodynamics.

There is some subtlety here related to fuzziness. A system’s goals may be achievable tovarious degrees, so that the goal region may be better modeled as a fuzzy set of lists of regions.Also, the system’s current state may be better viewed as a fuzzy set than as a crisp set. Thisis the case with CogPrime, where uncertain knowledge is labeled with condence values alongwith probabilities; in this case the condence of a logical statement may be viewed as the fuzzydegree with which it belongs to the system’s current state. But this doesn’t change the overallcognitive-geometrodynamic picture, it just adds a new criterion; one may say that the cognitionseeks a geodesic from a high-degree portion of the current-state region to a high-degree portionof the goal region.

B.5.3 Hypothesis 3: Cognitive Synergy

Cognitive synergy, discussed extensively above, is a conceptual explanation of what makes itpossible for certain sorts of integrative, multi-component cognitive systems to achieve powerfulgeneral intelligence [ ?]. The notion pertains to systems that possess knowledge creation (i.e.pattern recognition / formation / learning) mechanisms corresponding to each multiple memorytypes. For such a system to display cognitive synergy, each of these cognitive processes musthave the capability to recognize when it lacks the information to perform effectively on its own;and in this case, to dynamically and interactively draw information from knowledge creationmechanisms dealing with other types of knowledge. Further, this cross-mechanism interactionmust have the result of enabling the knowledge creation mechanisms to perform much moreeffectively in combination than they would if operated non-interactively.

How does cognitive synergy manifest itself in the geometric perspective we’ve sketched here?Perhaps the most straightforward way to explore it is to construct a composite metric, mergingtogether the individual metrics associated with specic memory spaces.

In general, given N metrics dk (x, z ), k = 1 . . . N dened on the same nite space M , we candene the "min-combination" metric

dd 1 ,...,d N (x, z ) = min y 0 = x,y n +1 = z,y i ∈M,r ( i )∈{1 ,...,N } ,i∈{1,...,n } ,n ∈Z

n

i =0

dr ( i ) (yi , yi +1 )

This metric is conceptually similar to (and mathematically generalizes) min-cost metrics likethe Levenshtein distance used to compare strings [ ?]. To see that it obeys the metric axioms isstraightforward; the triangle inequality follows similarly to the case of the Levenshtein metric.In the case where M is innite, one replaces min with inf (the inmum) and things proceed




similarly. The min-combination distance from x to z tells you the length of the shortest pathfrom x to z, using the understanding that for each portion of the path, one can choose any one of the metrics being combined. Here we are concerned with cases such as dsyn = dd Proc ,d Dec ,d Ep ,d Att .

We can now articulate a geometric version of the principle of cognitive synergy. Basically:

cognitive synergy occurs when the synergetic metric yields signicantly shorter distances be-tween relevant states and goals than any of the memory-type-specic metrics. Formally, onemay say that:

Denition B.1. An intelligent agent A (modeled by SRAM) displays cognitive synergy tothe extent

syn (A) ≡ (dsynergetic (x, z ) −min (dProc (x, z ), dDec (x, z ), dEp (x, z ), dAtt (x, z ))) dµ(x)dµ(z)

where µ measures the relevance of a state to the system’s goal-achieving activity.

B.6 Next Steps in Rening These Ideas

These ideas may be developed in both practical and theoretical directions. On the practicalside, we have already had an interesting preliminary success, described briey in ?? where weshow that (in some small examples at any rate) replacing CogPrime’s traditional algorithmfor attentional learning with an explicitly information-geometric algorithm leads to dramaticincreases in the intelligence of the attentional component. This work needs to be validated viaimplementation of a scalable version of the information geometry algorithm in question, andempirical work also needs to be done to validate the (qualitatively fairly clear) syntax-semanticscorrelation in this case. But tentatively, this seems to be an early example of improvement to anAGI system resulting from modifying its design to more explicitly exploit the mind-geometric

principles outlined here.Potentially, each of the inter-cognitive-process synergies implicit in the CogPrime design maybe formalized in the geometric terms outlined here, and doing so is part of our research programgoing forward.

More generally, on the theoretical side, a mass of open questions looms. The geometry of spaces dened by the min-combination metric is not yet well-understood, and neither is theFisher-Rao metric over nondimensional spaces or the algorithmic metric (especially in the case of generalized complexity criteria). Also the interpretation of various classes of learning algorithmsin terms of cognitive geometrodynamics is a subtle matter, and may prove especially fruitfulfor algorithms already dened in probabilistic or information-theoretic terms.

B.7 Returning to Our Basic Claims About CogPrimeFinally, we return to the list of basic claims about CogPrime given at the end of Chapter ?? ,and review their connection with the ideas in this appendix. Not all of the claims there aredirectly related to the ideas given here, but many of them are; to wit:



B.7 Returning to Our Basic Claims About CogPrime 13

6. It is most effective to teach an AGI system aimed at roughly human-like general intelli-gence via a mix of spontaneous learning and explicit instruction, and to instruct it via acombination of imitation, reinforcement and correction, and a combination of linguistic andnonlinguistic instruction

• Mindspace interpretation . Different sorts of learning are primarily focused on differ-ent types of memory, and hence on different mindspaces. The effectiveness of learningfocused on a particular memory type depends on multiple factors including: the generalcompetence of the agent’s learning process corresponding to that memory store, theamount of knowledge already built up in that memory store, and the degree of syntax-semantics correlation corresponding to that memory store. In terms of geometrodynam-ics, learning in a manner focused on a certain memory type, has signicant impact interms of reshaping the mindspace implied by that memory store.

7. One effective approach to teaching an AGI system human language is to supply it withsome in-built linguistic facility, in the form of rule-based and statistical-linguistics-basedNLP systems, and then allow it to improve and revise this facility based on experience

• Mindspace interpretation. Language learning purely in declarative space (formal

grammar rules), or purely in attentional space (statistical correlations between linguisticinputs), or purely in episodic or procedural space (experiential learning), will not benearly so effective as language learning which spans multiple memory spaces. Languagelearning (like many other kinds of humanly natural learning) is better modeled ascognitive-synergetic cognitive geometrodynamics, rather than as single-memory-typecognitive geometrodynamics.

8. An AGI system with adequate mechanisms for handling the key types of knowledge men-tioned above, and the capability to explicitly recognize large-scale pattern in itself, should,upon sustained interaction with an appropriate environment in pursuit of ap-propriate goals , emerge a variety of complex structures in its internal knowledge network,including (but not limited to)

• a hierarchical network, representing both a spatiotemporal hierarchy and an approxi-mate "default inheritance" hierarchy, cross-linked

• a heterarchical network of associativity, roughly aligned with the hierarchical network

• a self network which is an approximate micro image of the whole network

• inter-reecting networks modeling self and others, reecting a "mirrorhouse" designpattern

What does this mean geometrically?

• Mindspace interpretation . The self network and mirrorhouse networks imply aroughly fractal structure for mindspace, especially when considered across multiplememory types (since the self network spans multiple memory types). Peripherally, it’sinteresting that the physical universe has a very roughly fractal structure too, e.g. withsolar systems within galaxies within galactic clusters; so doing geometrodynamics in

roughly fractal curved spaces is not a new idea.9. Given the strengths and weaknesses of current and near-future digital computers,

a. A (loosely) neural-symbolic network is a good representation for directly storing manykinds of memory, and interfacing between those that it doesn’t store directly




• Mindspace interpretation . The "neural" aspect stores associative knowledge,and the "symbolic" aspect stores declarative knowledge; and the superposition of the two in a single network makes it convenient to implement cognitive processesembodying cognitive synergy between the two types of knowledge.

b. Uncertain logic is a good way to handle declarative knowledge• Mindspace interpretation . There are many senses in which uncertain logic is

"good" for AGI; but the core points are that:– it makes representation of real-world relationships relatively compact– it makes inference chains of real-world utility relatively short– it gives high syntax-semantics correlation for logical relationships involving un-

certainty (because it lends itself to syntactic distance measures that treat un-certainty naturally, gauging distance between two logical relationships basedpartly on the distances between the corresponding uncertainty values; e.g. thePLN metric dened in terms of SimilarityLink truth values)

– because the statistical formulas for truth value calculation are related to statis-tical formulas for association-nding, it makes synergy between declarative andassociative knowledge relatively straightforward.

c. Programs are a good way to represent procedures (both cognitive and physical-action,but perhaps not including low-level motor-control procedures)

d. Evolutionary program learning is a good way to handle difficult program learning prob-lems

• Probabilistic learning on normalized programs is one effective approach to evolu-tionary program learning

• MOSES is one good realization– Mindspace interpretation . Program normalization creates relatively high

syntax-semantics correlation in procedural knowledge (program) space, andMOSES is an algorithm that systematically exploits this knowledge.

e. Multistart hill-climbing on normalized programs, with a strong Occam prior, is a goodway to handle relatively straightforward program learning problems

f. Activation spreading is a reasonable way to handle attentional knowledge (though otherapproaches, with greater overhead cost, may provide better accuracy and may be ap-propriate in some situations)

• Articial economics is an effective approach to activation spreading in the contextof neural-symbolic network.

• ECAN is one good realization, with Hebbian learning as one route of learning asso-ciative relationships, and more sophisticated methods such as information-geometricones potentially also playing a role

• A good trade-off between comprehensiveness and efficiency is to focus on two kindsof attention: processor attention (represented in CogPrime by ShortTermImpor-tance) and memory attention (represented in CogPrime by LongTermImportance)

The mindspace interpretation includes the observations that

• Articial economics provides more convenient conversion between attentional anddeclarative knowledge, compared to more biologically realistic neural net type mod-els of attentional knowledge

• In one approach to structuring the attentional mindspace, historical knowledge re-garding what was worth attending (i.e. high-strength HebbianLinks between Atomsthat were in the AttentionalFocus at the same time, and linkages between these



B.7 Returning to Our Basic Claims About CogPrime 15

maps and system goals) serves to shape the mindspace, and learning the other Heb-bianLinks in the network may be viewed as an attempt to follow short paths throughattentional mindspace (as explicitly shown in Chapter ?? ).

g. Simulation is a good way to handle episodic knowledge (remembered and imagined)

• Running an internal "world simulation engine" is an effective way to handle simu-lationWhat’s the mindspace interpretation ? For example,

• The world simulation engine takes a certain set of cues and scattered memories re-lated to an episode, and creatively lls in the gaps to create a full-edged simulationof the episode. Syntax-semantics correlation means that stating "sets of cues andscattered memories" A and B are similar, is approximately the same as stating thatthe corresponding full-edged simulations are similar.

• Many dreams seem to be examples of following paths through episode space, fromone episode to another semantically related one, etc. But these paths are often aim-less, though generally following semantic similarity. Trying to think of or rememberan episode matching certain constraints, is a process where following short pathsthrough episodic mindspace is relevant.

h. One effective way to handle goals is to represent them declaratively, and allocate atten-tion among them economically

• CogPrime’s PLN/ECAN based framework for handling intentional knowledge is onegood realization

One aspect of the mindspace interpretation is that using PLN and ECAN togetherto represent goals, aids with the cognitive synergy between declarative, associative andintentional space. Achieving a goal is then (among other things) about nding shortpaths to the goal thru declarations, associations and actions.

10. It is important for an intelligent system to have some way of recognizing large-scale patternsin itself, and then embodying these patterns as new, localized knowledge items in its memory

• Given the use of a neural-symbolic network for knowledge representation, a graph-mining based "map formation" heuristic is one good way to do this

Key aspects of the mindspace interpretation are that:

• via map formation, associative (map) and declarative, procedural or episodic (localized)knowledge are correlated, promoting cognitive synergy

• approximate and emergent inference on concept maps, occurring via associational pro-cesses, roughly mirrors portions of PLN reasoning on declarative concepts and rela-tionships. This aids greatly with cognitive synergy, and in fact one can draw "naturaltransformations" (in the language of category theory) between map inference and lo-calized, declarative concept inference.

11. Occam’s Razor: Intelligence is closely tied to the creation of procedures that achieve goalsin environments in the simplest possible way .

• Each of an AGI system’s cognitive algorithms should embody a "simplicity bias" insome explicit or implicit form

Obviously, one aspect of the mindspace interpretation of this principle is simply thegeometrodynamic idea of following the shortest path through mindspace, toward the ap-




pointed set of goal states. Also, this principle is built into the denition of semantic spaceused in the mindspace framework developed above, since computational simplicity is usedto dene the semantic metric between memory items.

While the abstract "mind geometry" theory presented in this appendix doesn’t (yet) providea way of deriving the CogPrime design from rst principles, it does provide a useful generalvocabulary for discussing the various memory types and cognitive processes in CogPrime ina unied way. And it also has some power to suggest novel algorithms to operated withincognitive processes, as in the case of our work on information geometry and ECAN. Whethermind geometry will prove a really useful ingredient in CogPrime theory or AGI theory morebroadly, remains to be determined; but we are cautiously optimistic and intend to pursue furtherin this direction.



Appendix CEmergent Reexive Mental Structures

Co-authored with Tony Smith, Onar Aam and Kent Palmer

C.1 Introduction

This appendix deals with some complex emergent structures we suspect may emerge in advancedCogPrime and other AGI systems. The ideas presented here are atly conjectural, and we stressthat the CogPrime design is not dependent thereupon. The more engineering-oriented readermay skip them without any near-term loss. However, we do believe that this sort of rigorouslateral thinking is an important part of any enterprise as ambitious as building a human-levelAGI.

We have stated that the crux of an AGI system really lies on the emergent level – on thestructures and dynamics that arise in the system as a result of its own self-organization andits coupling with other minds and the external world. We have talked a bit about some of these emergent patterns – e.g. maps and various sorts of networks – but by and large they havestayed in the background. In this appendix we will indulge in a bit of speculative thinking aboutsome of the high-level emergent patterns that we believe may emerge in AGI systems once theybegin to move toward human-level intelligence, and specically once they acquire a reasonably

sophisticated ability to model themselves and other minds. 1

These patterns go beyond therelatively well-accepted network structures reviewed in chapter ?? , and constitute an edgier,more ambitious hypothesis regarding the emergent network structures of general intelligence.

More specically, the thesis of this appendix is that there are certain abstract algebraicstructures that typify the self-structure of human beings and any other intelligent systems re-lying on empathy for social intelligence. These structures may be modeled using various sortsof mathematics, including hypersets and also algebraic structures called quaternions and octo-nions (which also play a critical role in modern theoretical physics [ ?]). And, assuming mature,reasonably intelligent AGI systems are created, it will be possible to empirically determinewhether the mathematical structures posited here do or do not emerge in them.

1 A note from Ben Goertzel. In connection with the material in this appendix, I would like to warmly acknowledgeLouis Kauffman for an act of kindness that occurred back in 1986, when I was a 19 year old PhD student,

when he mailed me a copy of his manuscript Sign and Space , which contained so many wonderful ideas anddrawings related to the themes considered here. Lou’s manuscript wasn’t my rst introduction to the meme of consciousness and self-reference – I got into these ideas rst via reading Douglas Hofstadter at age 13 in 1979,and then later via reading G. Spencer-Brown. But my brief written correspondence with Lou (this was beforeemail was common even in universities) and his lovely hand-written and -drawn manuscript solidied my passionfor these sorts of ideas, and increased my condence that they are not only fascinating but deeply meaningful.

17



18 C Emergent Reexive Mental Structures

C.2 Hypersets and Patterns

The rst set of hypotheses we will pursue in this appendix is that the abstract structurescorresponding to free will, reective consciousness and phenomenal self are effectively modeledusing the mathematics of hypersets .

What are these things called hypersets, which we posit as cognitive models?In the standard axiomatizations of set theory, such as Zermelo-Frankel set theory [ ?], there

is an axiom called the Axiom of Foundation, which implies that no set can contain itself as amember. That is, it implies that all sets are "well founded" – they are built up from other sets,which in turn are built up from other sets, etc., ultimately being built up from the empty setor from atomic elements. The hierarchy via which sets are built from other sets may be innite(according to the usual Axiom of Innity), but it goes in only one direction – if set A is builtfrom set B (or from some other set built from set B), then set B can’t be built from set A (orfrom any other set built from set A).

However, since very shortly after the Axiom of Foundation was formulated, there have beenvarious alternative axiomatizations which allow "non-well-founded" sets (aka hypersets), i.e. setsthat can contain themselves as members, or have more complex circular membership structures.Hyperset theory is generally formulated as an extension of classical set theory rather than areplacement – i.e., the well-founded sets within a hyperset domain conform to classical settheory. In recent decades the theory of non-well-founded sets has been applied in computerscience (e.g. process algebra [ ?]), linguistics and natural language semantics (situation theory[?]), philosophy (work on the Liar Paradox [ ?]), and other areas.

For instance, in hyperset theory you can have

A = {A}A = {B, {A}}

and so forth. Using hypersets you can have functions that take themselves as arguments, andmany other interesting phenomena that aren’t permitted by the standard axioms of set theory.

The main work of this appendix is to suggest specic models of free will, reective consciousnessand phenomenal self in terms of hyperset mathematics.The reason the Axiom of Foundation was originally introduced was to avoid paradoxes like

the Russell Set (the set of all sets that do not contain themselves). None of these variant settheories allow all possible circular membership structures; but they allow restricted sets of such,sculpted to avoid problems like the Russell Paradox.

One currently popular form of hyperset theory is obtained by replacing the Axiom of Foun-dation with the Anti-Foundation Axiom (AFA) which, roughly speaking, permits circular mem-bership structures that map onto graphs in a certain way. All the hypersets discussed here areeasily observed to be allowable under the AFA (according to the the Solution Lemma stated in[?]).

Specically, the AFA uses the notion of an accessible pointed graph – a directed graph witha distinguished element (the "root") such that for any node in the graph there is at least one

path in the directed graph from the root to that node. The AFA states that every accessiblepointed graph corresponds to a unique set. For example, the graph consisting of a single vertexwith a loop corresponds to a set which contains only itself as element,

While the specic ideas presented here are novel, the idea of analyzing consciousness andrelated structures in terms of innite recursions and non-foundational structures has occurred



C.2 Hypersets and Patterns 19

before, for instance in the works of Douglas Hofstadter [ ?], G. Spencer-Brown [ ?], Louis Kauff-mann [?] and Francisco Varela [ ?]. None of these works uses hypersets in particular; but a moreimportant difference is that none of them attempts to deal with particular psychological phe-nomena in terms of correlation, causation, pattern theory or similar concepts; they essentially

stop at the point of noting the presence of a formalizable pattern of innite recursion in reec-tive consciousness. [ ?] does venture into practical psychology via porting some of R.D. Laing’spsychosocial "knots" [ ?] into a formal non-foundational language; but this is a very specializedexercise that doesn’t involve modeling general psychological structures or processes. Situationsemantics [ ?] does analyze various commonsense concepts and relationships using hypersets;however, it doesn’t address issues of subjective experience explicitly, and doesn’t present formaltreatments of the phenomena considered here.

C.2.1 Hypersets as Patterns in Physical or Computational Systems

Hypersets are large innite sets – they are certainly not computable – and so one might won-der if a hyperset model of consciousness supports Penrose [ ?] and Hameroff’s [?] notion of consciousness as involving as-yet unknown physical dynamics involving uncomputable mathe-matics. However, this is not our perspective.

In the following we will present a number of particular hypersets and discuss their presence aspatterns in intelligent systems. But this does not imply that we are positing intelligent systemsto fundamentally be hypersets, in the sense that, for instance, classical physics posits intelligentsystems to be matter in 3 + 1 dimensional space. Rather, we are positing that it is possible forhypersets to serve as patterns in physical systems, where the latter may be described in termsof classical or modern physics, or in terms of computation.

How is this possible? If a hyperset can produce a somewhat accurate model of a physicalsystem, and is judged simpler than a detailed description of the physical system, then it maybe a pattern in that system according to the denition of pattern given above.

Recall the denition of pattern given in chapter ?? :

Denition 1 Given a metric space (M, d ), and two functions c : M → [0, ∞] (the “simplicity measure”) and F : M → M (the “production relationship”), we say that P ∈ M is a patternin X ∈ M to the degree

ιP X = 1 −

d(F (P ), X )c(X )

c(X ) −c(P )c(X )

+

This degree is called the pattern intensity of P in X .

To use this denition to bridge the gap between hypersets and ordinary computer programsand physical systems, we may dene the metric space M to contain both hypersets and computerprograms, and also tuples whose elements may be freely drawn from either of these classes.Dene the partial order < so that if X is an entry in a tuple T , then X < T .

Distance between two programs may be dened using the algorithmic information metric

dI (A, B ) = I (A|B ) + I (B |A)




where I (A|B ) is the length of the shortest self-delimiting program for computing A given B [?].Distance between two hypersets X and Y may be dened as

dH (X, Y ) = dI (g(A), g(B ))

where g(A) is the graph (A’s apg, in AFA lingo) picturing A’s membership relationship. If A isa program and X is a hyperset, we may set d(A, X ) = ∞.

Next, the production relation F may be dened to act on a (hyperset,program) pair P =(X, A ) via feeding the graph representing X (in some standard encoding) to A as an input.According to this production relation, P may be a pattern in the bit string B = A(g(X )) ; andsince X < P , the hyperset X may be a subpattern in the bit string B .

It follows from the above that a hyperset can be part of the mind of a nite system describedby a bit string, a computer program, or some other nite representation. But what sense doesthis make conceptually? Suppose that a nite system S contains entities of the form

C

G(C )G(G(C ))G(G(G(C )))

...

Then it may be effective to compute S using a (hyperset, program) pair containing the hyperset

X = G(X )

and a program that calculates the rst k iterates of the hyperset. If so, then the hyperset

{X = G(X )} may be a subpattern in S . We will see some concrete examples of this in thefollowing.

Whether one thing is a pattern in another depends not only on production but also onrelative simplicity. So, if a system is studied by an observer who is able to judge some hypersetsas simpler than some computational entities, then there is the possibility for hypersets to besubpatterns in computational entities, according to that observer. For such an observer, thereis the possibility to model mental phenomena like will, self and reective consciousness ashypersets, consistently with the conceptualization of mind as pattern.

C.3 A Hyperset Model of Reective Consciousness

Now we proceed to use hypersets to model the aspect of mind we call "reective consciousness."Whatever your view of the ultimate nature of consciousness, you probably agree that different

entities in the universe manifest different kinds of consciousness or “awareness." Worms areaware in a different way than rocks; and dogs, pigs, pigeons and people are aware in a differentway from worms. In [ ?] it is argued that hypersets can be used to model the sense in which



C.3 A Hyperset Model of Reective Consciousness 21

the latter beasts are conscious whereas worms are not – i.e. what might be called "reectiveconsciousness."

We begin with the old cliché that

Consciousness is consciousness of consciousness

Note that this is nicely approximated by the series

AConsciousness of A

Consciousness of consciousness of A...

This is conceptually elegant, but doesn’t really serve as a denition or precise characterizationof consciousness. Even if one replaces it with

Reective consciousness is reective consciousness of reective consciousness

it still isn’t really adequate as a model of most reectively conscious experience – although itdoes seem to capture something meaningful.

In hyperset theory, one can write an equation

f = f (f )

with complete mathematical consistency. You feed f as input: f ... and you receive as output: f.

But while this sort of anti-foundational recursion may be closely associated with consciousness,this simple equation itself doesn’t tell you much about consciousness. We don’t really want tosay

ReectiveConsciousness = ReectiveConsciousness(ReectiveConsciousness)

It’s more useful to say:

Reective consciousness is a hyperset, and reective consciousness is contained in itsmembership scope

Here by the "membership scope" of a hyperset S, what we mean is the members of S, plusthe members of the members of S, etc. However, this is no longer a denition of reectiveconsciousness, merely a characterization. What it says is that reective consciousness must bedened anti-foundationally as some sort of construct via which reective consciousness buildsreective consciousness from reective consciousness – but it doesn’t specify exactly how.




Putting this notion together with the discussion from Chapter ?? on patterns, correlationsand experience, we arrive at the following working denition of reective consciousness. Assumethe existence of some formal language with enough power to represent nested logical predicates,e.g. standard predicate calculus will suffice; let us refer to expressions in this language as

"declarative content." Then we may sayDenition C.1. "S is reectively conscious of X" is dened as:The declarative content that {"S is reectively conscious of X" correlates with "X is a patternin S"}

For example: Being reectively conscious of a tree means having in one’s mind declarativeknowledge of the form that one’s reective consciousness of that tree is correlated with that treebeing a pattern in one’s overall mind-state. Figure C.1 graphically depicts the above denition.

Fig. C.1: Graphical depiction of "Ben is reectively conscious of his inner image of a moneytree"

Note that this declarative knowledge doesn’t have to be explicitly represented in the ex-periencer’s mind as a well-formalized language – just as pigeons, for instance, can carry outdeductive reasoning without having a formalization of the rules of Boolean or probabilisticlogic in their brains. All that is required is that the conscious mind has an internal "informal,possibly implicit" language capable of expressing and manipulating simple hypersets. Booleanlogic is still a subpattern in the pigeon’s brain even though the pigeon never explicitly applies



C.4 A Hyperset Model of Will 23

a Boolean logic rule, and similarly the hypersets of reective consciousness may be subpatternsin the pigeon’s brain in spite of its inability to explicitly represent the underlying mathematics.

Turning next to the question of how these hyperset constructs may emerge from nite sys-tems, Figures C.2, C.3 and C.4 show the rst few iterates of a series of structures that would

naturally be computed by a pattern containing as a subpattern Ben’s reective consciousnessof his inner image of a money tree. The presence of a number of iterates in this sort of series,as patterns or subpatterns in Ben, will lead to the presence of the hyperset of "Ben’s reectiveconsciousness of his inner image of a money tree" as a subpattern in his mind.

Fig. C.2: First iterate of a series that converges to Ben’s reective consciousness of his innerimage of a money tree

C.4 A Hyperset Model of Will

The same approach can be used to dene the notion of "will," by which is meant the sort of

willing process that we carry out in our minds when we subjectively feel like we are deciding tomake one choice rather than another [ ?].In brief:

Denition C.2. "S wills X" is dened as:The declarative content that {"S wills X" causally implies "S does X"}




Fig. C.3: Second iterate of a series that converges to Ben’s reective consciousness of his innerimage of a money tree

Fig. C.4: Third iterate of a series that converges to Ben’s reective consciousness of his innerimage of a money tree

Figure C.5 graphically depicts the above denition.To fully explicate this is slightly more complicated than in the case of reective consciousness,

due to the need to unravel what’s meant by "causal implication." For sake of the presentdiscussion we will adopt the view of causation presented in [ ?], according to which causal



C.4 A Hyperset Model of Will 25

Fig. C.5: Graphical depiction of "Ben wills himself to kick the soccer ball"

implication may be dened as: Predictive implication combined with the existence of a plausiblecausal mechanism.

More precisely, if A and B are two classes of events, then A "predictively implies B" if it’sprobabilistically true that in a situation where A occurs, B often occurs afterwards. (Of course,this is dependent on a model of what is a "situation", which is assumed to be part of the mindassessing the predictive implication.)

And, a "plausible causal mechanism" associated with the assertion "A predictively impliesB" means that, if one removed from one’s knowledge base all specic instances of situationsproviding direct evidence for "A predictively implies B", then the inferred evidence for "Apredictively implies B" would still be reasonably strong. (In PLN lingo, this means there isstrong intensional evidence for the predictive implication, along with extensional evidence.)

If X and Y are particular events, then the probability of "X causally implies Y" may beassessed by probabilistic inference based on the classes (A, B, etc.) of events that X and Ybelong to.

C.4.1 In What Sense Is Will Free?

Briey, what does this say about the philosophical issues traditionally associated with the notionof "free will"?




It doesn’t suggest any validity for the idea that will somehow adds a magical ingredientbeyond the familiar ingredients of "rules" plus "randomness." In that sense, it’s not a veryradical approach. It ts in with the modern understanding that free will is to a certain extentan "illusion", and that some sort of "natural autonomy" [ ?] is a more realistic notion.

However, it also suggests that "illusion" is not quite the right word. An act of will may havecausal implication, according to the psychological denition of the latter, without this action of will violating the notion of deterministic/stochastic equations of the universe. The key point isthat causality is itself a psychological notion (where within "psychological" I include cultural aswell as individual psychology). Causality is not a physical notion; there is no branch of sciencethat contains the notion of causation within its formal language. In the internal languageof mind, acts of will have causal impacts – and this is consistent with the hypothesis thatmental actions may potentially be ultimately determined via determistic/stochastic lower-leveldynamics. Acts of will exist on a different level of description than these lower-level dynamics.The lower-level dynamics are part of a theory that compactly explains the behavior of cells,molecules and particles; and some aspects of complex higher-level systems like brains, bodiesand societies. Will is part of a theory that compactly explains the decisions of a mind to itself.

C.4.2 Connecting Will and Consciousness

Connecting back to reective consciousness, we may say that:

In the domain of reective conscious experiences, acts of will are experienced as causal.

This may seem a perfectly obvious assertion. What’s nice is that, in the present perspective, itseems to fall out of a precise, abstract characterization of consciousness and will.

C.5 A Hyperset Model of Self

Finally, we posit a similar characterization for the cognitive structure called the "phenomenalself" – i.e. the psychosocial model that an organism builds of itself, to guide its interaction withthe world and also its own internal choices. For a masterfully thorough treatment of this entity,see Thomas Metzinger’s book Being No One [?]).

One way to conceptualize self is in terms of the various forms of memory comprising ahumanlike intelligence [ ?], which include procedural, semantic and episodic memory.

In terms of procedural memory, an organism’s phenomenal self may be viewed as a predictive model of the system’s behavior. It need not be a wholly accurate predictive model; indeed manyhuman selves are wildly inaccurate, and aesthetically speaking, this can be part of their charm.But it is a predictive model that the system uses to predict its behavior.

In terms of declarative memory, a phenomenal self is used for explanation – it is an ex-planatory model of the organism’s behaviors. It allows the organism to carry out (more or lessuncertain) inferences about what it has done and is likely to do.



C.5 A Hyperset Model of Self 27

In terms of episodic memory, a phenomenal self is used as the protagonist of the organism’sremembered and constructed narratives. It’s a ctional character, "based on a true story,"simplied and sculpted to allow the organism to tell itself and others (more or less) sensiblestories about what it does.

The simplest version of a hyperset model of self would be:Denition C.3. "X is part of S’s phenomenal self" is dened as the declarative content that{"X is a part of S’s phenomenal self" correlates with "X is a persistent pattern in S over time"}

Fig. C.6: Graphical depiction of "Ben’s representation-of/adaptation to his parrot is a part of his phenomenal self" ( Image of parrot is from a painting by Scheherazade Goertzel )

Figure C.6 graphically depicts the above denition.A subtler version of the denition would take into account the multiplicity of memory types:

Denition C.4. "X is part of S’s phenomenal self" is dened as the declarative content that{"X is a part of S’s phenomenal self" correlates with "X is a persistent pattern in S’s declarative,procedural and episodic memory over time"}

One thing that’s nice about this denition (in both versions) is the relationship that it appliesbetween self and reective consciousness. In a formula:

Self is to long-term memory as reective consciousness is to short-term memory

According to these denitions:




• A mind’s self is nothing more or less than its reective consciousness of its persistent being.

• A mind’s reective consciousness is nothing more or less than the self of its short-termbeing.

C.6 Validating Hyperset Models of Experience

We have made some rather bold hypotheses here, regarding the abstract structures presentin physical systems corresponding to the experiences of reective consciousness, free will andphenomenal self. How might these hypotheses be validated or refuted?

The key is the evaluation of hypersets as subpatterns in physical systems. Taking reectiveconsciousness as an example, one could potentially validate whether, when a person is (or,in the materialist view, reports being) reectively conscious of a certain apple being in frontof them, the hypothetically corresponding hyperset structure is actually a subpattern in theirbrain structure and dynamics. We cannot carry out this kind of data analysis on brains yet,but it seems within the scope of physical science to do so.

But, suppose the hypotheses presented here are validated, in the sense proposed above. Willthis mean that the phenomena under discussion – free will, reective consciousness, phenomenalself – have been "understood"?

This depends on one’s philosophy of consciousness. According to a panpsychist view, forinstance, the answer would seem to be "yes," at least in a broad sense – the hyperset modelspresented would then constitute a demonstratively accurate model of the patterns in physicalsystems corresponding to the particular manifestations of universal experience under discussion.And it also seems that the answer would be "yes" according to a purely materialist perspective,since in that case we would have gured out what classes of physical conditions correspond tothe "experiential reports" under discussion. Of course, both the panpsychist and materialistviews are ones in which the "hard problem" is not an easy problem but rather a non-problem!

The ideas presented here have originated within a patternist perspective, in which what’simportant is to identify the patterns constituting a given phenomenon; and so we have soughtto identify the patterns corresponding to free will, reective consciousness and phenomenal self.The "hard problem" then has to do with the relationships between various qualities that thesepatterns are hypothesized to possess (experiential versus physical) ... but from the point of view of studying brains, building AI systems or conducting our everyday lives, it is generallythe patterns (and their subpatterns) that matter.

Finally, if the ideas presented above are accepted as a reasonable approach, there is certainlymuch more work to be done. There are many different states of consciousness, many differentvarieties of self, many different aspects to the experience of willing, and so forth. These differ-ent particulars may be modeled using hypersets, via extending and specializing the denitionsproposed above. This suggested research program constitutes a novel variety of consciousnessstudies, using hypersets as a modeling language, which may be guided from a variety of direc-tions including empirics and introspection.



C.7 Implications for Practical Work on Machine Consciousness 29

C.7 Implications for Practical Work on Machine Consciousness

But what are the implications of the above ideas for machine consciousness in particular? Onevery clear implication is that digital computers probably can be just as conscious as humanscan. Why the hedge "probably"? One reason is the possibility that there are some very odd,unanticipated restrictions on the patterns realizable in digital computers under the constraintsof physical law. It is possible that special relativity and quantum theory, together, don’t allowa digital computer to be smart enough to manifest self-reective patterns of the complexitycharacteristic of human consciousness. (Special relativity means that big systems can’t thinkas fast as small ones; quantum theory means that systems with small enough componentshave to be considered quantum computers rather than classical digital computers.) This seemsextremely unlikely to me, but it can’t be rated impossible at this point. And of course, even if it’s true, it probably just means that machine consciousness needs to use quantum machines,or whatever other kind of machines the brain turns out to be.

Setting aside fairly remote possibilities, then, it seems that the patterns characterizing re-ective consciousness, self and will can likely emerge from AI programs running on digitalcomputers. But then, what more can be said about how these entities might emerge from theparticular cognitive architectures and processes at play in the current AI eld?

The answer to this question turns out to depend fairly sensitively on the particular AI archi-tecture under consideration. Here we will briey explore this issue in the context of CogPrime.

How do our hyperset models of reective consciousness, self and will reect themselves inthe CogPrime architecture?

There is no simple answer to these questions, as CogPrime is a complex system with multipleinteracting structures and dynamics, but we will give here a broad outline.

C.7.1 Attentional Focus in CogPrime

The key to understanding reective consciousness in CogPrime is the ECAN (Economic Atten-tion Networks) component, according to which each Atom in the system’s memory has certainShortTermImportance (STI) and LongTermImportance (LTI) values. These spread around thememory in a manner vaguely similar to activation spreading in a neural net, but using equa-tions drawn from economics. The equations are specically tuned so that, at any given time, acertain relatively small subset of Atoms will have signicantly higher STI and LTI values thanthe rest. This set of important Atoms is called the AttentionalFocus, and represents the "mov-ing bubble of attention" mentioned above, corresponding roughly to the Global Workspace inglobal workspace theory.

According to the patternist perspective, if some set of Atoms remains in the AttentionalFocusfor a sustained period of time (which is what the ECAN equations are designed to encourage),then this Atom-set will be a persistent pattern in the system, hence a signicant part of thesystem’s mind and consciousness. Furthermore, the ECAN equations encourage the formation

of densely connected networks of Atoms which are probabilistic attractors of ECAN dynamics,and which serve as hubs of larger, looser networks known as "maps." The relation betweenan attractor network in the AttentionalFocus and the other parts of corresponding maps thathave lower STI, is conceptually related to the feeling humans have that the items in their




focus of reective consciousness are connected to other dimly-perceived items "on the fringesof consciousness."

The moving bubble of attention does not in itself constitute humanlike "reective conscious-ness", but it prepares the context for this. Even a simplistic, animal-like CogPrime system with

almost no declarative understanding of itself or ability to model itself, may still have intenselyconscious patterns, in the sense of having persistent networks of Atoms frequently occupyingits AttentionalFocus, its global workspace.

C.7.2 Maps and Focused Attention in CogPrime

The relation between focused attention and distributed cognitive maps in CogPrime bears someemphasis, and is a subtle point related to CogPrime knowledge representation, which takes bothexplicit and implicit forms. The explicit level consists of Atoms with clearly comprehensiblemeanings, whereas the implicit level consists of “maps” as mentioned above – collections of Atoms that become important in a coordinated manner, analogously to cell assemblies in anattractor neural net.

Formation of small maps seems to follow from the logic of focused attention, along withhierarchical maps of a certain nature. But the argument for this is somewhat subtle, involvingcognitive synergy between PLN inference and economic attention allocation.

The nature of PLN is that the effectiveness of reasoning is maximized by (among other strate-gies) minimizing the number of incorrect probabilistic independence assumptions. If reasoningon N nodes, the way to minimize independence assumptions is to use the full inclusion-exclusionformula to calculate interdependencies between the N nodes. This involves 2N terms, one foreach subset of the N nodes. Very rarely, in practical cases, will one have signicant informationabout all these subsets. However, the nature of focused attention is that the system seeks tond out about as many of these subsets as possible, so as to be able to make the most accuratepossible inferences, hence minimizing the use of unjustied independence assumptions. Thisimplies that focused attention cannot hold too many items within it at one time, because if Nis too big, then doing a decent sampling of the subsets of the N items is no longer realistic.

So, suppose that N items have been held within focused attention, meaning that a lot of predicates embodying combinations of N items have been constructed and evaluated and rea-soned on. Then, during this extensive process of attentional focus, many of the N items will beuseful in combination with each other - because of the existence of predicates joining the items.Hence, many HebbianLinks (Atoms representing statistical association relationships) will growbetween the N items - causing the set of N items to form a map.

By this reasoning, focused attention in CogPrime is implicitly a map formation process – eventhough its immediate purpose is not map formation, but rather accurate inference (inferencethat minimizes independence assumptions by computing as many cross terms as is possiblebased on available direct and indirect evidence). Furthermore, it will encourage the formationof maps with a small number of elements in them (say, N < 10). However, these elements maythemselves be ConceptNodes grouping other nodes together, perhaps grouping together nodesthat are involved in maps. In this way, one may see the formation of hierarchical maps, formedof clusters of clusters of clusters..., where each cluster has N < 10 elements in it.

It is tempting to postulate that any intelligent system must display similar properties - sothat focused consciousness, in general, has a strictly limited scope and causes the formation of



C.7 Implications for Practical Work on Machine Consciousness 31

maps that have central cores of roughly the same size as its scope. If this is indeed a generalprinciple, it is an important one, because it tells you something about the general structureof concept networks associated with intelligent systems, based on the computational resourceconstraints of the systems. Furthermore this ties in with the architecture of the self.

C.7.3 Reective Consciousness, Self and Will in CogPrime

So far we have observed the formation of simple maps in OpenCogPrime systems, but we haven’tyet observed the emergence of the most important map: the self-map. According to the theoryunderlying CogPrime, however, we believe this will ensue once an OpenCogPrime -controlledvirtual agent is provided with sufficiently rich experience, including diverse interactions withother agents.

The self-map is simply the network of Nodes and Links that a CogPrime system uses topredict, explain and simulate its own behavior. "Reection" in the sense of cognitively reectingon oneself, is modeled in CogPrime essentially as "doing PLN inference, together with othercognitive operations, in a manner heavily involving one’s self-map."

The hyperset models of reective consciousness and self presented above, appear in the con-text of CogPrime as approximative models of properties of maps that emerge in the system dueto ECAN AttentionalFocus/map dynamics and its relationship with other cognitive processessuch as inference. Our hypothesis is that, once a CogPrime system is exposed to the rightsort of experience, it will internally evolve maps associated with reective cognition and self,which possess an internal recursive structure that is effectively approximated using the hypersetmodels given above.

Will, then, emerges in CogPrime in part due to logical Atoms known as CausalImplication-Links. A link of this sort is formed between A and B if the system nds it useful to hypothesisthat "A causes B." If A is an action that the system itself can take (a GroundedSchemaNode,in CogPrime lingo) then this means roughly that "If I chose to do A, then B would be likelyto ensue." If A is not an action the system can take, then the meaning may be interpretedsimilarly via abductive inference (i.e. via heuristic reasoning such as "If I could do A, and I didit, then B would likely ensue").

The self-map is a distributed network phenomenon in CogPrime’s AtomSpace, but the cog-nitive process called MapFormation may cause specic ConceptNodes to emerge that serve ashubs for this distributed network. These Self Nodes may then get CausalImplicationLinks point-ing out from them – and in a mature CogPrime system, we hypothesize, these will correlatewith the system’s feeling of willing . The recursive structure of will emerges directly from therecursive structure of self, in this case – if the system ascribes cause to its self, then withinitself there is also a model of its ascription of cause to its self (so that the causal ascriptionbecomes part of the self that is being ascribed causal power), and so forth on multiple levels.Thus one has a nite-depth recursion that is approximatively modeled by the hyperset modelof will described above.

All this goes well beyond what we have observed in the current CogPrime system (we havedone some causal inference, but not yet in conjunction with self-modeling), but it follows fromthe CogPrime design on a theoretical level, and we will be working over the next years to bringthese abstract notions into practice.




C.7.4 Encouraging the Recognition of Self-Referential Structures in the AtomSpace

Finally, we consider the possibility that a CogPrime system might explicitly model its own self and behavior using hypersets.

This is quite an interesting possibility, because, according to the same logic as map formation:if these hyperset structures are explicitly recognized when they exist, they can then be reasonedon and otherwise further rened, which may then cause them to exist more denitively ... andhence to be explicitly recognized as yet more prominent patterns ... etc. The same virtuouscycle via which ongoing map recognition and encapsulation leads to concept formation, mightpotentially also be made to occur on the level of complex self-referential structures, leading totheir renement, development and ongoing complexity.

One relatively simple way to achieve this in CogPrime would be to encode hyperset struc-tures and operators in the set of primitives of the "Combo" language that CogPrime uses torepresent procedural knowledge (a simple LISP-like language with carefully crafted hooks intothe AtomSpace and some other special properties). If this were done, one could then recognizeself-referential patterns in the AtomTable via standard CogPrime methods like MOSES andPLN.

This is quite possible, but it brings up a number of other deep issues that go beyond thescope of this appendix. For instance, most knowledge in CogPrime is uncertain, so if one isto use hypersets in Combo, one would like to be able to use them probabilistically. The mostnatural way to assign truth values to hyperset structure turns is to use innite order probabilitydistributions, as described in [ ?]. Innite-order probability distributions are partially-ordered,and so one can compare the extent to which two different self-referential structures applyto a given body of data (e.g. an AtomTable), via comparing the innite-order distributionsthat constitute their truth values. In this way, one can recognize self-referential patterns in anAtomTable, and carry out encapsulation of self-referential maps. This sounds very abstract andcomplicated, but the class of innite-order distributions dened in the above-referenced papersactually have their truth values dened by simple matrix mathematics, so there is really nothing

that abstruse involved in practice.Clearly, with this subtle, currently unimplemented aspect of the CogPrime design we areveering rather far from anything the human brain could plausibly be doing in detail. This isne, as CogPrime is not intended as a brain emulation. But yet, some meaningful connectionsmay be drawn to neuroscience. In ?? we have discussed how probabilistic logic might emergefrom the brain, and also how the brain may embody self-referential structures like the onesconsidered here, via (perhaps using the hippocampus) encoding whole neural nets as inputsto other neural nets. Regarding innite-order probabilities, the brain is effective at carryingout various dynamics equivalent to matrix manipulations, and one can mathematically reduceinnite-order probabilities to such manipulations, so that it’s not completely outlandish to positthe brain could be doing something mathematically analogous. Thus, all in all, it seems at leastplausible that the brain could be doing something roughly analogous to what we’ve describedhere, though the details would obviously be very different.



C.9 The Intrinsic Sociality of the Self 33

C.8 Algebras of the Social Self

In the remainder of this appendix we will step even further out on our philosophico-mathematicallimb, and explore the possibility that the recursive structures of the self involve mutual recursionaccording to the pattern of the quaternionic and octonionic algebras.

The argument presented in favor of this radical hypothesis has two steps. First, it is ar-gued that much of human psychodynamics consists of “internal dialogue” between separateinternal actors – some of which may be conceived as subselves a la [ ?], some of which maybe “virtual others” intended to explicitly mirror other humans (or potentially other entitieslike animals or software programs). Second, it is argued that the structure of inter-observationamong multiple inter-observing actors naturally leads to quaternionic and octonionic algebras.Specically, the structure of inter-observation among three inter-observers is quaternionic; andthe structure of inter-observation among four inter-observers is octonionic. This mapping be-tween inter-observation and abstract algebra is made particularly vivid by the realization thatthe quaternions model the physical situation of three mirrors facing each other in a triangle;whereas the octonions model the physical situation of four mirrors facing each other in a tetra-hedron, or more complex packing structures related to tetrahedra. Using these facts, we mayphrase the main thesis to be pursued in the remainder of the appendix in a simple form: Thestructure of the self of an empathic social intelligence is that of a quaternionic or octonionic mirrorhouse.

There is an intriguing potential tie-in with recent developments in neurobiology, which sug-gest that empathic modeling of other minds may be carried out in part via a “mirror neuronsystem” that enables a mind to experience another’s actions, in a sense, “as if they were itsown” [?]. There are also echoes here of Buckminster Fuller’s [ ?] philosophy, which viewed thetetrahedron as an essential structure for internal and external reality (since the tetrahedron isclosely tied with the quaternion algebra).

C.9 The Intrinsic Sociality of the Self

We begin the next step of our journey with a theme that is generally neglected within AI yet isabsolutely critical to humanlike intelligence: the social nature of the individual mind. In whatsense may it be said that the self of an individual human being is a “social” system?

A 2001 special issue of “Journal of Consciousness Studies” [ ?] provided an excellent summaryof recent research and thinking on this topic. A basic theme spanning several papers in theissue was as follows:

1. The human brain contains structures specically congured to respond to other humans’behaviors (these appear to involve “mirror neurons” and associated “mirror neuron systems,” on which we will elaborate below).

2. these structures are also used internally when no other people (or other agents) are present,because human self is founded on a process of continual interaction between “phenome-nal self” and “virtual other(s)”, where the virtual others are reected by the same neuralprocesses used to mirror actual others

3. so, the iteration between phenomenal self and actual others is highly wrapped up with theinteraction between phenomenal self and virtual others








reliant on mirror neurons played a key role in the evolution of language. These authors sug-gest that Broca’s area (associated with speech production) evolved on top of a mirror systemspecialized for grasping, and inherited from this mirror system a robust capacity for patternrecognition and generation, which was then used to enable imitation of vocalizations, and to

encourage “parity” in which associations involving vocalizations are roughly the same for thespeaker as for the hearer. According to the MSH, the evolution of language proceeded accordingto the following series of steps [ ?]:

1. S1: Grasping.2. S2: A mirror system for grasping, shared with the common ancestor of human and monkey.3. S3: A system for simple imitation of grasping shared with the common ancestor of human

and chimpanzee. The next 3 stages distinguish the hominid line from that of the great apes:4. S4: A complex imitation system for grasping.5. S5: Protosign, a manual-based communication system that involves the breakthrough from

employing manual actions for praxis to using them for pantomime (not just of manualactions), and then going beyond pantomime to add conventionalized gestures that candisambiguate pantomimes.

6. S6: Protospeech, resulting from linking the mechanisms for mediating the semantics of protosign to a vocal apparatus of increasing exibility. The hypothesis is not that S5 wascompleted before the inception of S6, but rather that protosign and protospeech evolvedtogether in an expanding spiral.

7. S7: Language: the change from action-object frames to verb-argument structures to syntaxand semantics.

As we will point out below, one may correlate this series of stages with a series of mirrorhousesinvolving an increasing number of mirrors. This leads to an elaboration of the MSH, whichposits that evolutionarily, as the mirrorhouse of self and attention gained more mirrors, thecapability for linguistic interaction became progressively more complex.

C.11 Quaternions and Octonions

In this section, as a preparation for our mathematical treatment of mirrorhouses and the self, wereview the basics of the quaternion and octonion algebras. This is not original material, but itis repeated here because it is not well known outside the mathematics and physics community.Readers who want to learn more should follow the references.

Most readers will be aware of the real numbers and the complex numbers. The complexnumbers are formed by positing an “imaginary number” i so that i*i=-1, and then looking at “complex numbers” of the form a+bi, where a and b are real numbers. What is less well knownis that this approach to extending the real number system may be generalized further. Thequaternions are formed by positing three imaginary numbers i, j and k with i*i=j*j=k*k=-1,and then looking at “quaternionic numbers” of the form a + bi + cj + dk. The octonions

are formed similarly, by positing 7 imaginary numbers i,j,k,E,I,J,K and looking at “octonionicnumbers” dened as linear combinations thereof.Why 3 and 7? This is where the math gets interesting. The trick is that only for these di-

mensionalities can one dene a multiplication table for the multiple imaginaries so that unique



C.11 Quaternions and Octonions 37

division and length measurement (norming) will work. For quaternions, the “magic multiplica-tion table” looks like

i∗ j = kj ∗i = −k

j ∗k = ik∗ j = −i

k∗i = j i∗k = − j

Using this multiplication table, for any two quaternionic numbers A and B, the equation

x∗A = B

has a unique solution when solved for x. Quaternions are not commutative under multiplication,unlike real and complex numbers: this can be seen from the above multiplication table in whiche.g. i*j is not equal to j*i. However, quaternions are normed: one can dene ||A||for a quaternionA, in the familiar root-mean-square manner, and get a valid measure of length fullling themathematical axioms for a norm.

Note that you can also dene an opposite multiplication for quaternions: from i*j = k youcan reverse to get j*i = k, which is an opposite multiplication, that still works, and basically just constitutes a relabeling of the quaternions. This is different from the complex numbers,where there is only one workable way to dene multiplication.

The quaternion algebra is fairly well known due to its uses in classical physics and computergraphics; the octonion algebra, also known as Cayley’s octaves, is less well known but is adeptlyreviewed by John Baez [ ?].

The magic multiplication table for 7 imaginaries that leads to the properties of unique divisionand normed-ness is shown in Table C.1. Actually this is just one of 480 basically equivalent (andequally “magical”) forms of the octonionic multiplication table (as opposed to the 2 varietiesfor quaternions, mentioned above). Note that, according to this or any of the other 479 tables,octonionic multiplication is neither commutative nor associative; but octonions do satisfy aweaker form of associativity called alternativity, which means that the subalgebra generated byany two elements is associative.

1 i j k E I J Ki -1 k -j I -E -K J j -k -1 i J K -E -Ik j -i -1 K -J I -EE -I -J -K -1 i j kI E -K J -i -1 -k jJ K E -I -j k -1 -iK -J I E -k -j i -1

Table C.1: Octonion multiplication table

As it happens, the only normed division algebras over the reals are the real, complex, quater-nionic and octonionic number systems. These four algebras also form the only alternative,nite-dimensional division algebras over the reals. These theorems are nontrivial to prove, and





C.12 Modeling Mirrorhouses Using Quaternions and Octonions 39

i = j ∗k = −k∗ j

j = k

∗

i =

−i

∗

k

k = i∗ j = − j ∗iThe quaternion algebra therefore is the precise model of three facing mirrors, where we seemirror inversion as the quaternionic anti-commutation. The two versions of the quaternionmultiplication table correspond to the two possible ways of arranging three mirrors into atriangular mirrorhouse.

When we move on to octonions, things get considerably subtler – though no less elegant,and no less conceptually satisfying. While there are 2 possible quaternionic mirrorhouses, thereare 480 possible octonionic mirrorhouses, corresponding to the 480 possible variant octonionmultiplication tables!

Recall that the octonions have 7 imaginaries i,j,k,E,I,J,K, which have 3 algebraic generatorsi,j,E (meaning that combining these three imaginaries can give rise to all the others). The thirdgenerator E is distinguished from the others, and we can vary it to get the 480 multiplication-s/mirrorhouses.

The simplest octonionic mirrorhouse is simply the tetrahedron (see Figure ?? ). More complexoctonionic mirrorhouses correspond to tetrahedra with extra mirrors placed over their internalcorners, as shown in Figure ?? . This gives rise to very interesting geometric structures, whichhave been explored by Buckminster Fuller and also by various others throughout history.

Start with a 3-dimensional tetrahedron of 4 facing mirrors. Let the oor be the distinguishedthird generator E and the 3 walls be I,J,K (with a specic assignment of walls to imaginaries,of course). Then, by reection through the E oor, the reected I J K become i j k, and we nowhave all 7 imaginary octonions. This relatively simple tetrahedral mirrorhouse corresponds toone of the 480 different multiplications; the one given in the table above.

To get another we truncate the tetrahedron. Truncation puts a mirror parallel to the oor,making a mirror roof. Then, when you look up at the mirror roof, you see the triangle roof parallel to the oor E. The triangle roof parallel to the oor E represents the octonion -E, andreection in the roof -E gives 7 imaginary octonions with the multiplication rule in which -E isthe distinguished third generator.

Looking up from the oor, you will also see 3 new triangles having a common side with thetriangle roof -E, and 6 new triangles having a common vertex with the triangle roof -E.




The triangle roof + 9 triangles = 10 triangles form half of the faces (one hemisphere) of a 20-face quasi-icosahedron. The quasi-icosahedron is only qualitatively an icosahedron, andis not exact, since the internal angle of the pentagonal vertex gure of the reected quasi-icosahedron is not 108 degrees, but is 109.47 degrees (the octahedral dihedral angle), and the

vertex angle is not 72 degrees, but is 70.53 degrees (the tetrahedral dihedral angle). (To getan exact icosahedral kaleidoscope, three of the triangles of the tetrahedron should be goldenisosceles triangles.)

Each of the 9 new triangles is a “reection roof” dening another multiplication. Now, lookdown at the oor E to see 9 new triangles reected from the 9 triangles adjoining the roof -E.Each of these 9 new triangles is a “reection oor” dening another multiplication. We havenow 1 + 1 + 9 + 9 = 20 of the 480 multiplications.

Just as we put a roof parallel to the oor E by truncating the top of the tetrahedral pyramid,we can put in 3 walls parallel to each of the 3 walls I, J, K by truncating the other 3 points of the tetrahedron, thus getting 3x20 = 60 more multiplications. That gives us 20 + 60 = 80 of the 480 multiplications.

To get the rest, recall that we xed the walls I, J, K in a particular order with respcet to theoor E. There are 3! = 6 permutations of the walls I, J, K Taking them into account, we getall 6x80 = 480 multiplications.

In mathematical terms, this approach effectively xes the 20-face quasi-icosahedron andvaries the 4 faces of the EIJK tetrahedron according to the 24-element binary tetrahedralgroup {3,3,2} = SL(2,3) to get the 20x24 = 480 multiplications.

Note that the truncated tetrahedron with a quasi-icosahedron at each vertex combines twotypes of symmetries:

1. tetrahedral , related to the square and the square root of 2, which gives open systems like:an arithmetic series overtone acoustic musical scale with common difference 1/8; the Roman

Sacred Cut in architecture; and multilayer space-lling cuboctahedral crystal growth.2. icosahedral , related to the pentagon, the Golden Mean (aka Golden Section), and Fi-bonacci sequences, which gives closed systems like: a harmonic pentatonic musical scale; LeCorbusier’s Modulor; and single-layer icosahedral crystals.



C.12 Modeling Mirrorhouses Using Quaternions and Octonions 41

It is interesting to observe that the binary icosahedral group is isomorphic to the binarysymmetry group of the 4-simplex, which may be called the pentahedron and which DavidFinkelstein and Ernesto Rodriguez (1984) have called the “Quantum Pentacle.” A pentahedronhas 5 vertices, 10 edges, 10 areas, and 5 cells. The 10 areas of a pentahedron correspond to the

10 area faces of one hemisphere of an icosahedron.The pentahedron projected into 3 dimensions looks like a tetrahedron divided into 4 quarter-tetrahedra (Figure ?? ). If you add a quarter-tetrahedron to each truncation of a truncatedtetrahedron, you get a space-lling polytope (Figure ?? ) that can be centered on a vertex of a 3-dimensional diamond packing to form a Dirichlet domain of the 3-dimensional diamondpacking (Figure ?? ). (A Dirichlet domain of a vertex in a packing is the set of points in thespace in which the packing is embedded that are nearer to the given vertex than to any other.)The 4 most distant vertices of the Dirichlet domain polytope are vertices of the dual diamondpacking in 3-dimensional space.

All in all, we conclude that:

1. In its simplest form the octonion mirrorhouse is a tetrahedral mirrorhouse2. In its more general form, the octonion mirrorhouse shows a tetrahedral diamond packing

network of quasi-icosahedra, or equivalently, of quasi-pentahedra

Observation as MirroringNow we proceed to draw together the threads of the previous sections: mirror neurons and

subselves, mirrorhouses and normed division algebras.To map the community of actors inside an individual self into the mirrorhouse/algebraic

framework of the previous section, it suffices to interpret the above




X = {Y }Y = {X }

as

“X observes Y” “Y observes X”

(e.g. we may have X= primary subself, Y=inner virtual other), and the above

i = { j, k} j = {k, i}k = {i, j }

as

“i observes {j observing k}” “j observes {k observing i}” “k observes {i observing j}”

Then we can dene the - observation as an inverter of observer and observed, so that e.g.

{ j, k} = −{k, j }We then obtain the quaternions

i = j ∗k = −k∗ j

j = k∗i = −i∗k

k = i∗ j = − j ∗i



C.13 Specic Instances of Mental Mirrorhousing 43

where multiplication is observation and negation is reversal of the order of observation. Threeinter-observers = quaternions.

The next step is mathematically natural: if there are four symmetric inter-observers, oneobtains the octonions, according to the logic of the above-described tetrahedral/tetrahedral-

diamond-packing mirrorhouse. Octonions may also be used to model various situations involv-ing more than four observes with particular asymmetries among the observers (the additionalobservers are the corner-mirrors truncating the tetrahedron.)

Why not go further? Who’s to say that the internal structure of a social mind isn’t related tomirrorhouses obtained from more complex shapes than tetrahedra and truncated tetrahedra?This is indeed not impossible, but intuitively, we venture the hypothesis that where humanpsychology is concerned, the octonionic structure is complex enough. Going beyond this levelone loses the normed division-algebra structure that makes the octonions a reasonably nicealgebra, and one also gets into a domain of dramatically escalated combinatorial complexity.

Biologically, what this suggests is that the MSH of Rizzolatti and Arbib just scratches thesurface. The system of mirror neurons in the human mind may in fact be a “mirrorhouse system,” involving four different cell assemblies, each involving substantial numbers of mirror neurons,and arranged in such a manner as to recursively reect and model one another. This is a concreteneurological hypothesis which is neither strongly suggested nor in any way refuted by availablebiological data: the experimental tools at our current disposal are simply not adequate to allowempirical exploration of this sort of hypothesis. The empirical investigation of cell assemblyactivity is possible now only in a very primitive way, using crude tools such as voltage-sensitivedyes which provide data with a very high noise level (see e.g. [ ?]. Fortunately though, theaccuracy of neural measurement technology is increasing at an exponential rate [ ?], so there isreason to believe that within a few decades hypotheses such as the presently positive “neuralmirrorhouse” will reside in the domain of concretely-explorable rather than primarily-theoreticalscience.

And nally, we may take this conceptual vision one more natural step. The mirrorhouseinside an individual person’s mind is just one small portion of the overall social mirrorworld.What we really have is a collection of interlocking mirrorhouses. If one face of the tetrahedroncomprising my internal mirrorhouse at a certain moment corresponds to one of your currentlyactive subselves, then we may view our two selves at that moment as two adjacent tetrahedra.We thus arrive at a view of a community of interacting individuals as a tiling of part of spaceusing tetrahedra, a vision that would have pleased Buckminster Fuller very much indeed.

C.13 Specic Instances of Mental Mirrorhousing

We’ve voyaged fairly far out into mathematical modeling land – what does all this mean interms of our everyday lives, or in terms of concrete AGI design?

Most examples of mental mirrorhousing, I suggest, are difficult for us to distinguish intro-spectively from other aspects of our inner lives. Mirroring among multiple subselves, simulationsof others and so forth is so fully woven into our consciousness that we don’t readily distinguishit from the rest of our inner life. Because of this, the nature of mental mirroring is most easilyunderstood via reference to “extreme cases.”

For instance, consider the following rather mundane real-life situation: Ben needs to estimatethe time-duration of a software project that has been proposed for the consulting division of




his AI software company. Ben knows he typically underestimates the amount of time requiredfor a project, but that he can usually arrive at a more accurate estimate via conversation withhis colleague Cassio. But Cassio isn’t available at the moment; or Ben doesn’t want to botherhim. So, Ben simulates an “internal Cassio,” and they dialogue together, inside Ben’s “mind’s

eye.” This is a mirror facing a mirror – an internal Ben mirroring an internal Cassio.But this process in itself may be more or less effective depending on the specics –dependingon, for example, which aspects of Ben or Cassio are simulated. So, an additional internal ob-serving mind may be useful for, effectively, observing multiple runs of the “Ben and Cassioconversation simulator” and studying and tuning the behavior. Now we have a quaternionicmirrorhouse.

But is there a deeper inner observer watching over all this? In this case we have an octonionic,tetrahedral mirrorhouse.

The above is a particularly explicit example – but we suggest that much of everyday life ex-perience consists of similar phenomena, where the different inter-mirroring agents are not neces-sarily associated with particular names or external physical agents, and thus are more difficultto tangibly discussed. As noted above, this relates closely to Rowan’s analysis of human person-ality as consisting largely of the interactional dynamics of various never-explicitly-articulatedand usually-not-fully-distinct subpersonalities.

For another sort of example, consider the act of creativity, which in [ ?] is modeled in termsof a “creative subself”: a portion of the mind that is specically devoted to creative activity inone more more media, and has its own life and awareness and memory apart from the primaryself-structure. The creative subself may create a work, and present it to the main subself forconsideration. The three of these participants – the primary subself, the creative subself and thecreative work – may stand in a relationship of quaternionic mirroring. And then the meta-self who observes this threefold interaction completes the tetrahedral mirrorhouse.

Next, let us briey consider the classic Freudian model of personality and motivation. Ac-cording to Freud, much of our psychology consists of interaction between ego, superego and id.Rather than seeking to map the precise Freudian notions into the present framework, we willbriey comment on how ideas inspired by these Freudian notions might play a role in the presentframework. The basic idea is that, to the extent that there are neuropsychological subsystemscorresponding to Freudian ego, superego and id, these subsystems may be viewed as agents thatmirror each other, and hence as a totality may be viewed as a quaternionic mirrorhouse. Morespecically we may correlate

1. ego with the neuropsychological structure that Thomas Metzinger (2004) has identied asthe “phenomenal self”

2. superego with the neuropsychological structure that represents the mind’s learned goal system – the set of goals that the system has created

3. id with the neuropsychological structure that represents the mind’s in-built goal system ,which largely consists of basic biological drives

Using this interpretation, we nd that a quaternionic ego/superego/id mirrorhouse may in-deed play a role in human psychology and cognition. However, there is nothing in the theoretical

framework being pursued here to suggest that this particular conguration of inter-observershas the foundational signicance Freud ascribed to it. Rather, from the present perspective,this Freudian triarchy appears as important conguration (but not the only one) that mayarise within the mirrorhouse of focused attention.






child); and the capability for abstract formal reasoning comes later in the “formal” stage of development. The natural hypothesis in this connection is that the child’s mind during theconcrete operational stage possesses only a quaternionic mirrorhouse (or at least, that only thequaternionic mirrorhouse is highly functional at this stage); and that the advent of the formal

stage corresponds to the advent of the octonionic mirrorhouse.This hypothesis has interesting biological applications, in the context of the previously hy-pothesized relationship between mirror neurons and mental mirroring. In this case, if the hy-pothesized correspondence between number-of-mirrors and developmental stages exists, thenit should eventually be neurologically observable via studying the patterns of interaction of cell assemblies whose dynamics are dominated by mirror neurons, in the brains of children atdifferent stages of cognitive development. As noted above, however, experimental neuroscienceis currently nowhere near being able to validate or refute such hypotheses, so we must wait atleast a couple decades before pursuing this sort of empirical investigation.

C.15 Concluding Remarks

Overall, the path traced in this appendix has been a somewhat complex one, but the broadoutline of the story is summarizable compactly.

Firstly, there may well be elegant recursive, self-referential structures underlying reectiveconsciousness, will and self.

And secondly, there may plausibly be elegant abstract-algebraic symmetries lurking withinthe social substructures of the self. The notion of "emergent structures of mind" may includeemergent algebraic structures arising via the intrinsic algebra of reective processes.

We have some even more elaborate and speculative conjectures extending the ideas givenhere, but will not burden the reader with them – we have gone as far as we have here, largelyto indicate the sort of ideas that arise when one takes the notion of emergent mind structuresseriously.

Ultimately, abstract as they are, these ideas must be pursued empirically rather than viaconceptual argumentation and speculation. If the CogPrime engineering program is successful,the emergence or otherwise of the structures discussed here, and others extending them, willbe discoverable via the mundane work of analyzing system logs.



Appendix DGOLEM: Toward an AGI Meta-ArchitectureEnabling Both Goal Preservation and RadicalSelf-Improvement

D.1 Introduction

One question that looms large when thinking about the ultimate roadmap for AGI and thepotential for self-modifying AGI systems is: How to create an AGI system that will maintainsome meaningful variant of its initial goals even as it dramatically revises andimproves itself – and as it becomes so much smarter via this ongoing improvement that inmany ways it becomes incomprehensible to its creators or its initial condition. We would like tobe able to design AGI systems that are massively intelligent, creatively self-improving, probablybenecial, and almost surely not destructive.

At this point, it’s not terribly clear whether an advanced CogPrime system would have thisdesirable property or not. It’s certainly not implausible that it would, since CogPrime doeshave a rich explicit goal system and is oriented to spend a signicant percentage of its effortrationally pursuing its goals. And with its facility for reinforcement and imitation learning,CogPrime is well suited to learn ethical habits from its human teachers. But all this falls veryfar short of any kind of guarantee.

In this appendix we’ll outline a general AGI meta-architecture called GOLEM (the GoalOriented LEarning Meta-architecture), that can be used as a "wrapper" for more detailedAGI architectures like CogPrime, and that appears (but hasn’t been formally proved) to havemore clearly desirable properties in terms of long-term ethical behavior. From a CogPrimeperspective, GOLEM may be viewed as a specic CogPrime conguration, which has powerful"AGI safety" properties but also demands a lot more computational resources than many otherCogPrime congurations would.

To specify these notions a bit further, we may dene an intelligent system as steadfast if,over a long period of time, it either continues to pursue the same goals it had at the start of the time period, or stops acting altogether . In this terminology, one way to confront the problem of creating probably-benecial, almost surely non-destructive AGI, is to solve the two problemsof:

• How to encapsulate the goal of benecialness in an AGI’s goal system• How to create steadfast AGI, in a way that applies to the "benecialness" goal amongothers

47



48D GOLEM: Toward an AGI Meta-Architecture Enabling Both Goal Preservation and Radical Self-Improvement

Of course, the easiest way to achieve steadfastness is to create a system that doesn’t change orgrow much. And the interesting question raised is how to couple steadfastness with ongoing,radical, transformative learning.

In this appendix we’ll present a careful semi-formal argument that, under certain reasonable

assumptions (and given a large, but not clearly long-term infeasible amount of computer power),the GOLEM meta-architecture is likely to be both steadfast and massively, self-improvinglyintelligent. Full formalization of the argument is left for later, and may be a difficult task evenif the argument is correct.

An alternate version of GOLEM is also described, which possesses more exibility to adaptto an unknown future, but lacks a rm guarantee of steadfastness.

Discussion of the highly nontrivial problem of "how to encapsulate the goal of benecialnessin an AGI’s goal system" is also left for elsewhere (see [ ?] for some informal discussion). Asreviewed already in Chapter ?? we suspect this will substantially be a matter of interactionand education rather than mainly a matter of explicitly formulating ethical content and tellingor feeding it to an AGI system.

D.2 The Goal Oriented Learning Meta-Architecture

The Goal Oriented LEarning Meta-architecture (GOLEM) refers to an AGI system S with thefollowing high-level meta-architecture, depicted roughly in Figure D.1:

Fig. D.1: The GOLEM meta-architecture. Single-pointed errors indication information ow;double-pointed arrows indicate more complex interrelationships.

• Goal Evaluator = component that calculates, for each possible future world (includingenvironment states and internal program states), how well this world fullls the goal (i.e.it calculates the "utility" of the possible world)



D.2 The Goal Oriented Learning Meta-Architecture 49

– it may be that the knowledge supplied to the GoalEvaluator initially (the "base GEOP"i.e. "base GoalEvaluator Operating Program") is not sufficient to determine the goal-satisfaction provided by a world-state; in that case the GoalEvaluator may produce aprobability distribution over possible goal-satisfaction values

– initially the GoalEvaluator may be supplied with an inefficient algorithm encapsulatingthe intended goals, which may then be optimized and approximated by application of the Searcher (thus leading to a GEOP different from the base GEOP)

– if the GoalEvaluator uses a GEOP produced by the Searcher, then there may be anadditional source of uncertainty involved, which may be modeled by having the GoalE-valuator output a second-order probability distribution (a distribution over distributionsover utility values), or else by collapsing this into a rst-order distribution

• HistoricalRepository = database storing the past history of S’s internal states and ac-tions, as well as information about the environment during S’s past

• Operating Program = the program that S is governing its actions by, at a given pointin time

– chosen by the Metaprogram as the best program the Searcher has found, where "best" is judged as "highest probability of goal achievement" based on the output of the Predictorand the Goal Evaluator

• Predictor = program that estimates, given a candidate operating program P and a possiblefuture world W, the odds of P leading to W

• Searcher = program that searches through program space to nd a new program optimizinga provided objective function

• Memory Manager program = program that decides when to store new observations andactions in the Historical Repository, and which ones to delete in order to do so; potentiallyit may be given some hard-wired constraints to follow, such as "never forget human history,or the previous century of your life."

• Tester = hard-wired program that estimates the quality of a candidate Predictor, using asimple backtesting methodology

– That is, the Tester assesses how well a Predictor would have performed in the past,using the data in the HistoricalRepository

• Metaprogram = xed program that uses Searcher program to nd a good

– Searcher program (judged by the quality of the programs it nds, as judged by thePredictor program)

– Predictor program (as judged by the Tester’s assessments of its predictions)– Operating Program (judged by Predictor working with Goal Evaluator, according to the

idea of choosing an Operating Program with the maximum expected goal achievement)– GoalEvaluator Operating Program (judged by the Tester, evaluating whether a candi-

date program effectively predicts goal-satisfaction given program-executions, accordingto the HistoricalRepository)

– Memory Manager (as judged by Searcher, which rates potential memory managementstrategies based on the Predictor’s predictions of how well the system will fare undereach one)

The Metaprogram’s choice of Operating Program, Goal Evaluator Operating Program andMemory Manager may all be interdependent, as the viability of a candidate program for




each of these roles may depend on what program is playing each of the other roles. Themetaprogram also determines the amount of resources to allocate to searching for a Searcherversus a Predictor versus an OP, according to a xed algorithm for parameter adaptation.

While this is a very abstract "meta-architecture", it’s worth noting that it could be imple-mented using CogPrime or any other practical AGI architecture as a foundation – in this case,CogPrime is "merely" the initial condition for the OP, the Memory Manager, the Predictorand the Searcher. However, demonstrating that self-improvement can proceed at a useful ratein any particular case like this, may be challenging.

Note that there are several xed aspects in the above: the MetaProgram, the Tester, theGoalEvaluator, and the structure of the HistoricalRepository. The standard GOLEM, withthese aspects xed, will also be called the xed GOLEM , in contrast to an adaptive GOLEM inwhich everything is allowed to be adapted based on experience.

D.2.1 Optimizing the GoalEvaluator

Note that the GoalEvaluator may need to be very smart indeed to do its job. However, animportant idea of the architecture is that the optimization of the GoalEvaluator’s functionalitymay be carried out as part of the system’s overall learning 1.

In its initial and simplest form, the GoalEvaluator’s internal Operating Program (GEOP)could basically be a giant simulation engine, that tells you, based on a codied denition of the goal function: in world-state W, the probability distribution of goal-satisfaction values isas follows. It could also operate in various other ways, e.g. by requesting human input whenit gets confused in evaluating the desirability of a certain hypothetical world-state; by doingsimilarity matching according to a certain codied distance measure against a set of desirableworld-states; etc.

However, the Metaprogram may supplement the initial "base GEOP" with an intelligentGEOP, which is learned by the Searcher, after the Searcher is given the goal of nding a

program that will

• accurately agree with the base GEOP across the situations in the HistoricalRepository, asdetermined by the Tester

• be as compact as possible

In this approach, there is a "base goal evaluator" that may use simplistic methods, but then thesystem learns programs that do approximately the same thing as this but perhaps faster andmore compactly, and potentially embodying more abstraction . Since this program learning hasthe specic goal of learning efficient approximations to what the GoalEvaluator does, it’s notsusceptible to "cheating" in which the system revises its goals to make them easier to achieve(unless the whole architecture gets broken).

What is particularly interesting about this mechanism is: it provides a built-in mechanism forextrapolation beyond the situations for which the base GEOP was created. The Tester requiresthat the learned GEOPs must agree with the base GEOP on the HistoricalRepository, but forcases not considered in the HistoricalRepository, the Metaprogram is then doing Occam’s Razor

1 this general idea was introduced by Abram Demski upon reading an earlier draft of this appendix, though hemay not agree with the particular way I have improvised on his idea here



D.2 The Goal Oriented Learning Meta-Architecture 51

based program learning, seeing a compact and hence rationally generalizable explanation of thebase GEOP.

D.2.2 Conservative Meta-Architecture Preservation

Next, the GOLEM meta-architecture assumes that the goal embodied by the GoalEvaluatorincludes, as a subgoal, the preservation of the overall meta-architecture described above (witha fallback to inaction if this seems infeasible). This may seem a nebulous assumption, but it’snot hard to specify if one thinks about it the right way.

For instance, one can envision each of the items in the above component list as occupying aseparate hardware component, with messaging protocols established for communicating betweenthe components along cables. Each hardware component can be assumed to contain some controlcode, which is connected to the I/O system of the component and also to the rest of thecomponent’s memory and processors.

Then what we must assume is that the goal includes the following criteria, which we’ll callconservative meta-architecture preservation :1. No changes to the hardware or control code should be made except in accordance with the

second criterion2. If changes to the hardware or control code are found, then the system should stop act-

ing (which may be done in a variety of ways, ranging from turning off the power to self-destruction; we’ll leave that unspecied for the time being as that’s not central to the pointwe want to make here)

Any world-state that violates these criteria, should be rated extremely low by the GoalEvalua-tor.

D.2.3 Complexity and Convergence Rate

One might wonder why such a complex architecture is necessary. Why not just use, say, Schmid-huber’s Godel Machine [ ?] ? This is an architecture that, in theory, can take an arbitrary goalfunction, and gure out how to achieve it in a way that is provably optimal given its currentknowledge and capabilities – including guring out how to modify itself so that it can betterachieve the goal in future, after the modications take hold. If the specics of the GOLEMarchitecture are a good idea, then a Godel Machine should eventually transform itself into aGOLEM.

The catch, however, lies in the word ”eventually.” Depending on the situation and the com-putational resources available, a Godel Machine might take quite a long time to form itself into a GOLEM or something similar. In the real world, while this time is passing, the Godel

Machine itself could be accidentally doing bad things due to reasoning short-cuts it’s forced totake in order to get actions produced within a reasonable time-frame given its limited resources.The nite but potentially large time-frame that a Godel Machine would take to converge to aGOLEM-like state might be a big deal in real-life terms; just as the large constant overhead




involved in simulating a human brain on a 2012 Macbook plus a lot of hard drives, is a big dealin practice in spite of its being a triviality from a computing theory perspective.

This may seem like hair-splitting, because in order to work at all, GOLEM would also requirea lot of computing resources. The hope with which GOLEM is presented, is that it will be able

to work with merely a humongous amount of computing resource, rather than, like the GodelMachine in its simple and direct form, an infeasible amount of computing resource. This hasnot been proved and currently remains a tantalizing conjecture.

D.3 The Argument For GOLEM’s Steadfastness

Our main goal here is to argue that a program with (xed) GOLEM meta-architecture will besteadfast, in the sense that it will maintain its architecture (or else stop acting) while seekingto maximize the goal function implicit in its GoalEvaluator.

Why do we believe GOLEM can be steadfast? The basic argument, put simply, is that: If

• the GoalEvaluator and environment together have the property that:– world-states involving conservative meta-architecture preservation tend to have very

high tness– world-states not involving conservative meta-architecture preservation tend to have very

low tness– world-states approximately involving conservative meta-architecture preservation tend

to have intermediate tness

• the initial Operating Program has a high probability of leading to world-states involvingconservative meta-architecture preservation (and this is recognized by the GoalEvaluator)

then the GOLEM meta-architecture will be preserved. Because: according to the nature of themetaprogram, it will only replace the initial Operating Program with another program that is

predicted to be more effective at achieving the goal, which means that it will be unlikely toreplace the current OP with one that doesn’t involve conservative meta-architecture preserva-tion.

Obviously, this approach doesn’t allow full self-modication; it assumes certain key parts of the AGI (meta)architecture are hard-wired. But the hard-wired parts are quite basic and leavea lot of exibility. So the argument covers a fairly broad and interesting class of goal functions.

D.4 A Partial Formalization of the Architecture and SteadfastnessArgument

To partially formalize the above conceptual argument, we will assume the formal agents model

introduced earlier.



D.4 A Partial Formalization of the Architecture and Steadfastness Argument 53

D.4.1 Toward a Formalization of GOLEM

We will use the notation [A → B ] to denote the space of functions mapping space A to space B .Also, in cases where we denote a function signature via ΦX , we will use X to denote the spaceof all programs embodying functions of that signature; e.g. GE is the space of all functionsfullling the specication given for ΦGE .

The GOLEM architecture may be formally dened as follows.

• The Historical Repository Ht is a subset of the history x0− t

• An Operating Program is a program embodying a function Ψ OP : H → A. That is, basedon a history (specically, the one contained in the Historical Repository at a given point intime), it generates actions

• A Memory Manager is a program embodying a function so that Ψ MM (Ht , x t ) = Ht +1

• A Goal Evaluator is a program embodying a function Ψ GE : H → [0, 1]. That is, it mapshistories (hypothetical future histories, in the GOLEM architecture) into real numbersrepresenting utilities

• A Goal Evaluator Operating Program is an element of class GE

• A Searcher is a program embodying a function Ψ SR : [P → [0, 1]] → P . That is, it maps"tness functions" on program space into programs.

• A Predictor is a program embodying a function Ψ P R : OP ×GE ×H → [0, 1]

• A Tester is a program embodying a function Ψ T R : P R ×H → [0, 1], where the output [0, 1]is to be interpreted as the output of the prediction

• A Metaprogram is a program embodying a function Ψ MP : SR ×H ×P R ×T R ×GE 2 ×MM → SR ×P R ×OP ×GE ×MM . The GE in the output, and one of the GEs in theinput, are GEOPs.

The operation of the Metaprogram is as outlined earlier; and the effectiveness of the architecturemay be assessed as its average level of goal achievement as evaluated by the GE, according tosome appropriate averaging measure.

As discussed above, a xed GOLEM assumes a xed GoalEvaluator, Tester and Metaprogram,

and a xed structure for the Historical Repository, and lets everything else adapt. One mayalso dene an adaptive GOLEM variant in which everything is allowed to adapt, and this willbe discussed below, but the conceptual steadfastness argument made above applies only to thexed-architecture variant, and the formal proof below is similarly restricted.

Given the above formulation, it may be possible to prove a variety of theorems aboutGOLEM’s steadfastness under various assumptions. We will not pursue this direction very farhere, but will only make a few semi-formal conjectures, proposing some semi-formal propositionsthat we believe may result in theorems after more work.

D.4.2 Some Conjectures About GOLEM

The most straightforward cases in which to formally explore the GOLEM architecture are notparticularly realistic ones. However, it may be worthwhile to begin with less realistic cases thatare more analytically tractable, and then proceed with the more complicated and more realisticcases.




Conjecture D.1. Suppose that

• The Predictor is optimal (for instance an AIXI type system)

• Memory management is not an issue: there is enough memory for the system to store allits experiences with reasonable access time

• The GE is sufficiently efficient that no approximative GEOP is needed

• The HR contains all relevant information about the world, so that at any given time, thePredictor’s best choices based on the HR are the same as the best choices it would makewith complete visibility into the past of the universe

Then, there is some time T so that from T onwards, GOLEM will not get any worse at achievingthe goals specied in the GE, unless it shuts itself off.

The basic idea of Conjecture D.1 is that, under the assumptions, GOLEM will replace itsvarious components only if the Predictor predicts this is a good idea, and the Predictor isassumed optimal (and the GE is assumed accurate, and the Historical Repository is assumedto contain as much information as needed). The reason one needs to introduce a time T > 0 isthat the initial programs might be clever or lucky for reasons that aren’t obvious from the HR.

If one wants to ensure that T = 0 one needs some additional conditions:

Conjecture D.2. In addition to the assumptions of Conjecture D.1, assume GOLEM’s initialchoices of internal programs are optimal based on the state of the world at that time. Then,GOLEM will never get any worse at achieving the goals specied in the GE, unless it shutsitself off.

Basically, what this says is: If GOLEM starts off with an ideal initial state, and it knowsvirtually everything about the universe that’s relevant to its goals, and the Predictor is ideal– then it won’t get any worse as new information comes in; it will stay ideal. This would benice to know as it would be verication of the sensibleness of the architecture, but, isn’t muchpractical use as these conditions are extremely far from being achievable.

Furthermore, it seems likely that

Conjecture D.3. Suppose that

• The Predictor is nearly optimal (for instance an AIXI tl type system)

• Memory management is not a huge issue: there is enough memory for the system to storea reasonable proportion of its experiences with reasonable access time

• The approximative GEOP is place is very close to accurate

• The HR contains a large percentage of the relevant information about the world, so that atany given time, the Predictor’s best choices based on the HR are roughly same as the bestchoices it would make with complete visibility into the past of the universe

Then, there is some time T so that from T onwards, GOLEM is very unlikely to get signi-cantly worse at achieving the goals specied in the GE, unless it shuts itself off.

Basically, this says that if the assumptions of Conjecture D.1 are weakened to approximations,then the conclusion also holds in an approximate form. This also would not be a practicallyuseful result, as the assumptions are still too strong to be realistic.

What might we be able to say under more realistic assumptions? There may be results suchas






• allowing the Rewarder to see the OP, and packing the Predictor and GoalEvaluator intothe Rewarder. In this case the Rewarder is tasked with giving the system a reward basedon the satisfactoriness of the predicted outcome of running its Operating Program.

• allowing the Searcher to query the Rewarder with hypothetical actions in hypothetical

scenarios (thus allowing the Rewarder to be used like the GoalEvaluator!)This RL++ approach is basically the GOLEM in RL clothing. It requires a very smart Rewarder,since the Rewarder must carry out the job of predicting the probability of a given OP giving riseto a given world-state. The GOLEM puts all the intelligence in one place, which seems simpler.In RL++, one faces the problem of how to nd a good Predictor, which may be solved byputting another Searcher and Metaprogram inside the Rewarder; but that complicates thingsinelegantly.

Note that the Predictor and GoalEvaluator are useful in RL++ specically because we areassuming that in RL++ the Rewarder can see the OP. If the Rewarder can see the OP, it canreward the system for what it’s going to do in the future if it keeps running the same OP,under various possible assumptions about the environment. In a strict RL design, the Rewardercannot see the OP, and hence it can only reward the system for what it’s going to do based on

chancier guesswork. This guesswork might include guessing the OP from the system’s actions– but note that, if the Rewarder has to learn a good model of what program the system isrunning via observing the system’s actions, it’s going to need to observe a lot of actions to getwhat it could get automatically by just seeing the OP. So the learning of the system can bemuch, much faster in many cases, if the Rewarder gets to see the OP and make use of thatknowledge. The Predictor and GoalEvaluator are a way of making use of this knowledge.

Also, note that in GOLEM the Searcher can use the Rewarder to explore hypothetical sce-narios. In a strict RL architecture this is not possible directly; it’s possible only via the systemin effect building an internal model of the Rewarder, and using it to explore hypothetical sce-narios. The risk here is that the system builds a poor model of the Rewarder, and thus learnsless efficiently.

In all, it seems that RL is not the most convenient framework for thinking about architecture-preserving AGI systems, and looking at "goal-oriented architectures" like GOLEM makes things

signicantly easier.

D.6 Specifying the Letter and Spirit of Goal Systems (Are BothDifficult Tasks)

Probably the largest practical issue arising with the GOLEM meta-architecture is that, giventhe nature of the real world, it’s hard to estimate how well the Goal Evaluator will do its job!If one is willing to assume GOLEM, and if a proof corresponding to the informal argumentgiven above can be found, then the "predictably benecial" part of the problem of "creatingpredictably benecial AGI" is largely pushed into the problem of the GoalEvaluator.

This makes one suspect that the hardest problem of making predictably benecial AGIprobably isn’t "preservation of formally-dened goal content under self-modication." Thismay be hard if one enables total self-modication, but it seems it may not be that hard if oneplaces some fairly limited restrictions on self-modication, as is done in GOLEM, and beginswith an appropriate initial condition.



D.7 A More Radically Self-Modifying GOLEM 57

The really hard problem, it would seem, is how to create a GoalEvaluator that implementsthe desired goal content – and that updates this goal content as new information about theworld is obtained, and as the world changes – in a way that preserves the spirit of the originalgoals even if the details of the original goals need to change as the world is explored and better

understood. Because the "spirit" of goal content is a very subtle and subjective thing.The intelligent updating of the GEOP, including in the GOLEM design, will not updatethe original goals, but it will creatively and cleverly apply them to new situations as theyarise – but it will do this according to Occam’s Razor based on its own biases rather thannecessarily according to human intuition, except insofar as human intuition is encoded in thebase GEOP or the initial Searcher. So it seems sensible to expect that, as unforeseen situationsare encountered, a GOLEM system will act according to learned GEOPs that are rationallyconsidered "in the spirit of the base GEOP", but that may interpret that "spirit" in a differentway than most humans would. These are subtle issues, and important ones; but in a sensethey’re "good problems to have", compared to problems like evil, indifferent or wireheaded 2

AGI systems.

D.7 A More Radically Self-Modifying GOLEM

It’s also possible to modify the GOLEM design so as to enable it to modify the GEOP moreradically – still with the intention of sticking to the spirit of the base GEOP, but allowing it tomodify the "letter" of the base GEOP so as to preserve the "spirit." In effect this modicationallows GOLEM to decide that it understands the essence of the base GEOP better than thosewho created the particulars of the base GEOP. This is certainly a riskier approach, but it seemsworth exploring at least conceptually.

The basic idea here is that, where the base GEOP is uncertain about the utility of a world-state, the "inferred GEOP" created by the Searcher is allowed to be more denite. If the baseGEOP comes up with a probability distribution P in response to a world-state W , then theinferred GEOP is allowed to come up with Q so long as Q is sensibly considered a renementof P .

To see how one might formalize this, imagine P is based on an observation-set O1 containingN observations. Given another distribution Q over utility values, one may then ask: What isthe smallest number K so that one can form an observation set O2 containing O1 plus K moreobservations, so that Q emerges from O2? For instance, if P is based on 100 observations, arethere 10 more observations one could make so that from the total set of 110 observations, Qwould be the consequence? Or would one need 200 more observations to get Q out of O2?

Given an error > 0, let the minimum number K of extra observations needed to createan O2 yielding Q within error , be denoted obs (P, Q ). If we assume that the inferred GEOPoutputs a condence measure along with each of its output probabilities, we can then explorethe relationship between these condence values and the obs values.

Intuitively, if the inferred GEOP is very condent, this means it has a lot of evidence aboutQ, which means we can maybe accept a somewhat large obs(P, Q ). On the other hand, if theinferred GEOP is not very condent, then it doesn’t have much evidence supporting Q, so wecan’t accept a very large obs(P, Q ).

2 A term used to refer to situations where a system rewires its reward or goal-satisfaction mechanisms to directlyenable its own maximal satisfaction









Appendix ELojban++: A Novel Linguistic Mechanism forTeaching AGI Systems

E.1 Introduction

Human “natural language” is unnatural to an AGI program like CogPrime. Yet, understandingof human language is obviously critical to any AGI system that wants to interact exibly inthe human world, and/or that wants to ingest the vast corpus of knowledge that humans havecreated and recorded. With this in mind, it is natural to explore humanly-unnatural ways of granting AGI systems knowledge of human language; and we have done much of this in theprevious appendices, discussing the use of linguistic resources that are clearly different in naturefrom the human brain’s in-built linguistic biases. In this appendix we consider yet anotherhumanly-unnatural means of providing AGI systems with linguistic knowledge: the use of theconstructed language Lojban (or, more specically, its variant Lojban++), which occupies aninteresting middle ground between formal languages like logic and programming languages, andhuman natural languages. We will argue that communicating with AGI systems in Lojban++may provide a way of

• providing AGI systems with experientially-relevant commonsense knowledge, much moreeasily than via explicitly encoding this knowledge in logic

• teaching AGI systems natural language much more quickly than would otherwise be possi-ble, via communicating with AGIs in parallel using natural language and Lojban++

To put it more precisely: the essential goal of Lojban++ is to constitute a language for effi-cient, minimally ambiguous, and user-friendly communications between humans and suitably-constructed AI software agents such as CogPrime’s. Another way to think about the Lojban++approach is that it allows an AGI learning/teaching process that dissociates, to a certain ex-tent, “learning to communicate with humans” from “learning to deal with the peculiarities of human languages.” Similar to Lojban on which it is based, Lojban++ may also be used forcommunication between humans, but this interesting possibility will not be our focus here.

Some details on the particulars of the Lojban++ language proposal, aimed at readers familiarwith Lojban, are given at the end of this appendix. In the initial portions of the appendix wedescribe Lojban++ and related ideas at a more abstract level, in a manner comprehensible toreaders without prior Lojban background.

61



62 E Lojban++: A Novel Linguistic Mechanism for Teaching AGI Systems

E.2 Lojban versus Lojban++

Lojban is itself an outgrowth of another constructed language, Loglan, created by Dr. JamesCooke Brown around 1955 and rst widely announced in a 1960 Scientic American article[?]. Loglan is still under development but now is not used nearly as widely as Lojban. Firstseparated from Loglan in 1987, Lojban is a constructed language that lives at the border betweennatural language and computing language. It is a “natural-like language” in that it is speakableand writeable by humans and may be used by humans to discuss the same range of topics asnatural languages. Lojban has a precise, specied formal syntax that can be parsed in the samemanner as a programming language, and it has a semantics, based on predicate logic, in whichambiguity is carefully controlled. Lojban semantics is not completely unambiguous, but it is farless ambiguous than that of any natural language, and the careful speaker can reduce ambiguityof communication almost to zero with far less effort than in any natural language. On the otherhand, Lojban also permits the speaker to utilize greater ambiguity when this is desirable inorder to allow compactness of communication.

Many individuals attempting to learn and use Lojban have found, however, that it has twolimitations. The Lojban vocabulary is unfamiliar and difficult to learn - though no more so thanthat of any other language belonging to a language family unfamiliar to the language learner.And, more seriously, the body of existing Lojban vocabulary is limited compared to that of natural languages, making Lojban communication sometimes slow and difficult. When usingLojban, one must sometimes pause to concoct new words (according to the Lojban principlesof word construction), which can be fun, but is much like needing to stop over and over to buildnew tools in the context of using one’s toolkit to build something; and is clearly not optimalfrom the perspective of teaching AGI systems.

To address these issues, Lojban++ constitutes a combination of Lojban syntax and Lojbanvocabulary, extended with English vocabulary. So in a very rough sense, it may perhaps beunderstood as a pidgin of Lojban and English. Lojban++ is less elegant than Lojban butsignicantly easier to learn, and much easier to use in domains to which Lojban vocabulary hasnot yet been extended. In short, the goal of Lojban++ is to combine the mathematical precision

and pragmatic ontology that characterize Lojban, with the usability of a natural language likeEnglish with its extensive vocabulary.An extensive formal treatment of Lojban grammar has been published [ ?], and while there is

no published hard-copy Lojban dictionary, there is a website jbovlaste.lojban.org/ thatserves this purpose and which is frequently updated as new coinages are created and approvedby the Logical Language Group, a standing body charged with the maintenance of the language.

Although Lojban has not been adopted nearly as widely as Esperanto (an invented languagewith several hundred thousand speakers), the fact that there is a community of several hundredspeakers, including several dozen who are highly uent at least in written Lojban, is important.The decades of communicative practice that have occurred within the Lojban community havebeen invaluable for rening the language. This kind of practice buys a level of maturity thatcannot be obtained in a shorter period of time via formal analysis or creative invention. Forexample, the current Lojban treatment of quantiers is arguably vastly superior to that of

any natural language [ ?], but that was not true in 1987 when it excelled more in mathematicalprecision than in practical usability. The current approach evolved through a series of principledrevisions suggested from experience with practical conversation in Lojban. Any new natural-likelanguage that was created for human-CogPrime or CogPrime -CogPrime communication would

http://localhost/var/www/apps/conversion/tmp/scratch_1/jbovlaste.lojban.org/

http://localhost/var/www/apps/conversion/tmp/scratch_1/jbovlaste.lojban.org/



E.3 Some Simple Examples 63

need to go through a similar process of iterative renement through practical use to achieve asimilar level of usability.

E.3 Some Simple Examples

Now we give some examples of Lojban++. While these may be somewhat opaque to the readerwithout Lojban experience, we present them anyway just to give a avor of what Lojban++looks like; it would seem wrong to leave the discussion purely abstract.

Consider the English sentence,

When are you going to the mountain?

When written in Lojban, it looks like:

do cu’e klama le cmana

In Lojban++, with the judicious importation of English vocabulary, it takes a form more

recognizable to an English speaker:you cu’e go le cmana

A fairly standard predicate logic rendition of this, derived by simple, deterministic rules fromthe Lojban++ version, would be

atTime(go(you, mountain), ?X)

Next, consider the more complex English sentence,

When are you going to the small obsidian mountain?

In Lojban, there is no word for obsidian, so one needs to be invented (perhaps by compoundingthe Lojban words for “glass” and “rock,” for example), or else a specic linguistic mechanismfor quoting non-Lojban words needs to be invoked. But in Lojban++ one could simply say,

you cu’e go le small obsidian mountain

The construct “small obsidian mountain” is what is called a Lojban tanru, meaning a compoundof words without a precisely dened semantics (though there are recognized constraints on tanrusemantics based on the semantics of the components [ ?]). Alternatively, using the Lojban word,marji, which incorporates explicit place structure (x1= material/stuff/matter of compositionx2), a much less ambiguous translation is achieved:

you cu’e go le small mountain poi marji loi obsidian

in which “poi marji loi obsidian” means “that is composed of [a mass of] obsidian.” This illustratesthe exible ambiguity achievable in Lojban. One can use the language in a way that minimizesambiguity, or one can selectively introduce ambiguity in the manner of natural languages, whendesirable.

The differences between Lojban and Lojban++ are subtler than it might appear at rst. Itis key to understand hat Lojban++ is not simply a version of Lojban with English character-sequences substituted for Lojban character-sequences. A critical difference lies in the rigid,pre-determined argument structures associated with Lojban words. For instance, the Lojbanphrase




klama la .atlantas. fe la bastn. fu le karce

corresponds to the English phrase

that which goes from Atlanta to Boston by car

To say this in Lojban++ without using “klama” would require

go ’o source Atlanta ’o destination Boston ’o vehicle car

which is much more awkward. On the other hand, one could also avoid the awkward Lojbantreatment of English proper nouns and say

klama la Atlanta fe la Boston fu le car

or

klama la Atlanta fe la Boston fu le karce

It’s somewhat a matter of taste, but according to ours, the latter most optimally balances

simplicity with familiarity. The point is that the Lojban word “klama” comes with the conventionthat its second argument (indexed by “”) refers to the source of the going, its third argument(indexed by “fe”) refers to the destination of the going, and its fth argument (indexed by “fu”)refers to the method of conveyance. No such standard argument-structure template exists inEnglish for “go”, and hence using “go” in place of “klama” requires the use of the “’o” constructto indicate the slot into which each of the arguments of “go” is supposed to fall.

The following table gives additional examples, both in Lojban and Lojban++.English I eat the salad with croutonsLojban mi citka le salta poi mixre lo sudnabybliLojban++ mi eat le salad poi mixre lo crouton

mi eat le salad poi contain lo croutonEnglish I eat the salad with a forkLojban mi citka le salta sepi’o lo forcaLojban++ mi eat le salad sepi’o lo forkEnglish I will drive along the road with the big treesLojban mi litru le dargu poi lamji lo barda tricuLojban++ mi ba travel ’o vehicle lo car ’o route le road poi adjacent

lo so’i big treemi ba litru lo car fe le road poi adjacent lo so’i big treemi ba drive ’o route le road poi adjacent lo so’i big tree

English I will drive along the road with great careLojban mi litru le dargu ta’i lo nu mi mutce kurjiLojban++ mi ba drive ’o route le road ta’i lo nu mi much careful

mi ba litru le road ta’i lo nu mi much carefulEnglish I will drive along the road with my infrared sensors onLojban mi ba litru le dargu lo karce gi’e pilno le miktrebo’a terzgaLojban++ mi litru le road lo car gi’e use le infrared sensor

mi litru le road lo car gi’e pilno le infrared te zganami drive ’o vehicle lo car ’o route le road gi’e use leinfrared sensor



E.4 The Need for Lojban Software 65

English I will drive along the road with the other carsLojban mi litru le dargu ’o kansa lo drata karceLojban++ mi ba drive ’o route le road ’o kansa lo so’i drata car

mi ba drive ’o route le road ’o with lo so’i drata car

mi ba litru le road ’o kansa lo so’i drata car

E.4 The Need for Lojban Software

In order that Lojban++ be useful for human-CogPrime communications, parsing and semanticmapping software need to be produced for the language, building on existing Lojban software.

There is a fully functional Lojban parser based on a parsing expression grammar (Powell, nodate specied), as well as an earlier parser based on BNF grammar. (And, parenthetically, theobservation that Lojban is more conveniently formulated in PEG (Parsing Expression Gram-mar) form is in itself a nontrivial theoretical insight.) The creation of a Lojban++ parser basedon the existing Lojban parser, is a necessary and a relatively straightforward though not trivial

task.On the other hand, no software has yet been written for formal semantic interpretation

(“semantic mapping”) of Lojban expressions - which is mainly because Lojban has primarilybeen developed as an experimental language for communication between humans rather than asa language for human-CogPrime communication. Such semantic mapping software is necessaryto complete the loop between humans and AI reasoning programs, enabling powerful cognitiveand pragmatic interplay between humans and CogPrime’s. For Lojban++ to be useful forhuman-CogPrime interaction, this software must be created and must go in both directions:from Lojban++ to predicate logic and back again. As Lojban++ is a superset of Lojban,creating such software for Lojban++ will automatically include the creation of such softwarefor Lojban proper.

There promises to be some subtlety in this process, but not on the level that’s required tosemantically map human language. What is required to connect a Lojban++ parser with theRelEx NLP framework as described in Chapter ?? is essentially a mapping between

• the Lojban cmavo (structure word) and the argument-structure of lojban gismu (root word)

• FrameNet frame-elements, and a handful of other CogPrime relationships (e.g. for dealingwith space, time and inheritance)

These mappings must be built by hand, which should be time-consuming, but on the order of man-weeks rather than man-years of effort. 1 Once this is done, then Lojban++ can be enteredinto CogPrime essentially as English would be if the RelEx framework worked perfectly . Thedifficulties of human language processing will be bypassed, though still – of course – leaving thedifficulties of commonsense reasoning and contextual interpretation.

For example, the Lojban root word klama is dened as

x1 comes/goes to destination x2 from origin x3 via route x4 using means/vehicle x5.

This corresponds closely to the FrameNet frame Motion, which has elements

1 Carrying out the following mapping took a few minutes, so carrying out similar mappings for 800 FrameNetframes should take no more than a couple weeks of effort.




• Theme (corresponding to x1 in the above Lojban denition)

• Source (x2)

• Goal (x3)

• Path (x4)

• Carrier (x5)The Motion FrameNet frame also has some elements that klama lacks, e.g. Distance and Direc-tion, which could of course be specied in Lojban using explicit labeling.

E.5 Lojban and Inference

Both Lojban and Lojban++ can be straightforwardly translated into predicate logic format(though the translation is less trivial in the case of Lojban++, as a little bit of English-worddisambiguation must be done). This means that as soon as Lojban++ semantic mapping soft-ware is constructed, it will almost immediately be possible for CogPrime systems to reason about

knowledge communicated to them in Lojban. This aspect of Lojban has already been exploredin a preliminary way by Speer and Havasi’s [ ?] JIMPE software application, which involves asemantic network guiding logical reasoning, Lojban parsing and Lojban language production.While JIMPE is a relatively simplistic prototype application, it is clear that more complexexample of Lojban-based articial inference are also relatively straightforwardly achievable viaa conceptually similar methodology.

An important point to consider in this regard is that Lojban/Lojban++ contains two distinctaspects:

1. an ontology of predicates useful for representing commonsense knowledge (represented bythe Lojban cmavo along with the most common Lojban content words)

2. a strategy for linearizing nested predicates constructed using these cmavo into human-pronounceable and -readable strings of letters or phonemes.

The second aspect is of no particular value for inference, but the rst aspect is. We suggest thatthe Lojban++ ontology provides a useful framework for knowledge representation that may beincorporated at a fundamental level into any AI system that centrally utilizes predicate logicor a similar representation. While overlapping substantially with FrameNet, it has a level of commonsensical completeness that FrameNet does not, because it has been rened via practiceto be useful for real-world communication. Similarly, although it is smaller than Cyc, it is more judiciously crafted. Cyc contains a lot of knowledge not useful for everyday communication,yet has various lacunae regarding the description of everyday objects and events – because nocommunity has ever seriously tried to use it for everyday communication.

E.5.1 Lojban versus Predicate Logic

In the context of Lojban++ and inference, it is interesting to compare Lojban++ formulationswith corresponding predicate logic formulations. For example, consider the English sentence

Hey, I just saw a bunch of troops going into the woods. What do you want me to do?



E.6 Discussion 67

translates into the Lojban

ju’i do’u mi pu zi viska lo nu so’i lo sonci cu nenkla le ricfoi .i do djica lo nu mi mo

or the Lojban++

Hey do’u mi pu zi see lo nu so’i lo soldier cu enter le forest .i do want lo nu mi mo

which literally transcribed into English would be something like

Hey! [vocative terminator] I [past] [short time] see an event of (many soldiers enter forest).You want event (me what?)

Omitting the “hey,” a simple and accurate predicate logic rendition of this sentence would be

past ($X )∧short_time ($X )∧($X = see (me, $Y ))∧($Y = event (enter ($Z, forest ))) ∧soldier ($Z )∧many ($Z )∧want (you , event (?W (me ))

where ?W refers to a variable being posed as a question be answered, and X and so forthrefer to internal variables. The Lojban and Lojban++ versions have the same semantics as thepredicate logic version, but are much simpler to speak, hear and understand due to the lack of explicit variables.

E.6 Discussion

Hopefully the above exposition of Lojban++, though incomplete, was sufficient to convince youthat teaching “infant-level” or “child-level” AGIs about the world using Lojban++ would be sig-nicantly easier than teaching doing so using English or other natural languages. The questionthen is whether this difference makes any difference. One could counter-argue that, if an AGIwere smart enough to really learn to interpret Lojban++, then it would be smart enough tolearn to interpret English as well, with only minor additional effort. In sympathy with thiscounter-argument is the fact that successfully mapping Lojban++ utterances into predicatelogic expressions, and representing these predicate logic expressions in an AI’s knowledge base,does not in itself constitute any serious “understanding” of the Lojban++ utterances on thepart of the AI system. However, this counter-argument ignores the “chicken and egg problem” of common-sense knowledge and language understanding. If an AGI understands natural languagethen it can be taught human common-sense knowledge via direct linguistic instruction. On theother hand, it is also clear that a decent amount of common-sense knowledge is a prerequi-site for adequate natural language understanding (for such tasks as parse selection, semanticdisambiguation and reference resolution, for example). One response to this is to appeal to feed-back, and argue that common-sense knowledge and linguistic understanding are built to ariseand grow together. We believe this is largely true, and yet that there may also be additional

dynamics at play in the developing human mind that accelerate the process, such as inbuilt in-ductive biases regarding syntax. In an AGI context, one way to accelerate the process may be touse Lojban++ to teach the young AGI system commonsense knowledge, which then may help itto more easily penetrate the complexities and ambiguities of natural language. This assumes, of course, that the knowledge gained by the system from being instructed in Lojban++ is genuine





E.8 Syntax-based Argument Structure Conventions for English Words 69

agged as a proper name (by the cmavo “la”), is an English word intended to be interpreted asa brivla. It will not do any parsing of the word to try to interpret tense, number, adverbiality,etc. Next, English idiomatic collocations, if used in written Lojban++, should be used with anunderscore between the component words. For example: New_York, run_wild, big_shot, etc.

Without the underscore, the Lojban++ parser will assume that it is seeing a tanru (so thate.g. “big shot” is a type of “shot” that is modied by “big”). In spoken Lojban, the formallycorrect thing is to use the new cmavo “quay” to be discussed below; but in practice whenusing Lojban++ for human-human communication this may often be omitted. Finally, a lessformal guideline concerns the use of highly ambiguous English words, the use of obscure sensesof English words, and the use of English words in metaphorical senses. All of these should beavoided. They won’t confuse the Lojban++ parsing process, but they will confuse the Lojban++semantic mapping process. If a usage seems like it would confuse an AI program without muchhuman cultural experience, then try to avoid it. Don’t say

you paint ti

to mean “paint” in the sense of “portray vividly”, when you could say

you cu vivid bo describe ti The latter will tell an AI exactly what’s happening; the former may leave the AI wonderingwhether what’s being depicted is an instance of description, or an instance of painting with anactual paintbrush and oils. Similarly, to say

you kill me

when you mean

you much amuse me

is not in the Lojban++ spirit. Yes, an AI may be able to gure this out by reference todictionaries combined with contextual knowledge and inference, but the point of Lojban++ is to

make communication simple and transparent so as to reduce the possibility for communicationerror.

E.8 Syntax-based Argument Structure Conventions for English Words

Next, one of the subtler points of Lojban++ involves the automatic assignment of Lojbanargument-structures to English words. This is done via the following rules:

1. Nouns are interpreted to have one argument, which is interpreted as a member of thecategory denoted by the noun

a. la Ben human

1. Adjectives/adverbs are taken to have two arguments: the rst is the entity modied by theadjective/adverb, the second is the extent to which the modication holds

a. la Ben fat le slight




1. Intransitive verbs are interpreted to have at least one argument, which is interpreted as theargument of the predicate represented by the verb

a. le cockroach die

1. Transitive verbs are interpreted to have at least two arguments, the subject and then theobject

a. la Ben kill le cockroach

1. Ditransitive verbs are interpreted to have three arguments, and conventions must be madefor each of these cases, e.g.

1. a. give x y z may be interpreted as “x give y to z” i. la Ben give le death le cockroach

b. take x y z may be interpreted as “x takes y from z” i. la Ben take le life le cockroach

A rule of thumb here is that the agent comes rst, the recipient comes last, and the objectcomes inbetween.

E.9 Semantics-based Argument Structure Conventions for EnglishWords

The above syntax-based argument-structure conventions are valuable, but not sufficientlythorough to allow for uent Lojban++ usage. For this reason a collection of semantics-based argument-structure conventions have been created, based mostly on porting argument-structures from related Lojban words to English vocabulary. The following list is the currentworking version, and is likely to be extended a bit during actual usage.

1. Plant or animal (moss, cow, pig)

a. x1 is a W of species x2

2. Spatial relation (beneath, above, right, left)

a. x1 is in relation W to x2, in reference frame x3

3. Dimension-dependent spatial descriptor (narrow, deep, wide, etc.)

a. x1 is W in dimension x2, relative to standard x3

4. Unit (foot, hour, meter, mile)

a. x1 is x2 W’s by standard x3

5. Kinship or other interpersonal relationship (mother, father, uncle, boss)

a. x1 is the W of x2

6. Thought-action (remember, think, intuit, know)



E.9 Semantics-based Argument Structure Conventions for English Words 71

a. x1 W’s x2b. x1 W’s x2 about x3

7. Creative product (poem, painting, book)

a. x1 is a W about plot/theme/subject/pattern x2 by author x3 for intended audience x4

8. Physical action undertaken by one agent on another (touch, kick, kiss)

a. x1 (agent) W’s x2 with x3 [a locus on x1 or an instrument] at x4 [a locus on x2]

9. W denotes a type of substance, e.g. mush, paste, slime

a. x1 is a W composed of x2

10. Instance of communication (ask, tell, command)

a. x1 W’s x2 with information content x3

11. Type of utterance (comment, question)

a. x1 (text) is a W about subject x2 expressed by x3 to audience x4

12. Type of movement (walking, leaping, jumping, climbing)

a. x1 (agent/object) W’s to x2 from x3 in direction x4

13. Route, path, road, trail, etc.

a. x1 is a W to x2 from x3 via/dened by points including x4 (set)

14. Nationality, culture etc.

a. x1 reects W in aspect x2

15. Type of event involving humans or other social agents (celebration, meeting, funeral)a. x1 partakes, with purpose x2, in event x3 of type W

16. Posture or mode of physical activity of an embodied agent (stand, sit, lie, stoop)

a. x1 W’s on surface x2

17. Type of mental construct (idea, thought, dream, conjecture, etc.)

a. x1 is a W about x2 by mind x3

18. Type of event done by someone, potentially to someone else (accident, disaster, injury)

a. x1 is a W done by x2 to x3

19. Comparative amount (half, third, double, triple)

a. x1 is W of x2 in quality x3

20. Relation between an agent and a statement (assert, doubt, refute, etc.)




a. x1 W’s x2

21. Spatial relationship (far, near, close)

a. x1 is W from x2 in dimension x3

22. Human emotion (happy, sad, etc.)

a. x1 is W about x2

23. A physically distinct part of some physical object, including a body part

a. x1 is a W on x2

24. Type of physical transformation (e.g. mash, pulverize, etc.)

a. x1 [force] W’s x2 into mass x3

25. Way of transmitting an object (push, throw, toss, ing)

a. x1 W’s object x2 to/at/in direction x326. Relative size indicator (big, small, huge)

a. x1 is W relative to x2 by standard x3

E.10 Lojban gismu of clear use within Lojban++

There are some Lojban gismu (content words) which are clearly much more useful withinLojban++ than their English counterparts. Mostly this is because their argument structuresinvolve more than two arguments, but occasionally it is because they involve a two-argumentstructure that happens not to be well-captured by any English word (but is usually representedin English by a more complex construct involving one or more prepositions).

A list of roughly 300 gismu currently judged to be “essential” in this sense is at http://www.goertzel.org/papers/gismu_essential.txt , and a list of less than 50 additionalgismu judged potentially very useful but not quite so essential is at urlhttp://www.goertzel.org/papers/gismu_usefu

E.11 Special Lojban++ cmavo

Next, there are some special cmavo (structure words) that are useful in Lojban++ but notpresent in ordinary Lojban. A few more Lojban++ cmavo may be added as a result of practicalexperience communicating using Lojban++; but these are it, for now.

http://www.goertzel.org/papers/gismu_essential.txt






E.11 Special Lojban++ cmavo 73

E.11.1 qui

Pronounced “kwee”, this is a cmavo used in Loglish to create words with unambiguous senses,as in the example:

pig qui animal

pig qui cop

The second English word in the compound is a sense-specier. Generally this should only beused where the word-sense intended is not the one that would be most obviously expected giventhe context.

In some rare cases one might want two modiers, using the form

(English word) qui (English word) qui (English word)

E.11.2 it, quu The basic idea is that there is one special referential word in Lojban++ – “it” – which goesalong with a reference-target-indicator “quu” (pronounced “kwuhh”) which gives a qualitativeindication of the referent of a given instance of “it,” intended to narrow down the scope of thereference resolution process.

For instance, you could say

la Dr. Benjamin Goertzel cu proceed le playground. It quu man cu kill le dog. It cu eat le cat.

In this case, “it” is dened to refer to “Dr. Benjamin Goertzel”, not to “man” generically. The “man” qualier following the “quu” is intended to merely guide the listener’s mind toward theright antecedent for the pronoun. It’s not intended to explicitly dene the pronoun. So, basically

it quu male

is the rough equivalent of the English “he”, and

it quu female

is the rough equivalent of the English “she”

him/her/they

Finally, for sake of usability, it is worthwhile within Lojban++ to introduce the followingshorthands

• him → it quu male

• her

→ it quu female

• ver → it quu person

• they → it quu people

(Note that “him” in Lojban++ thus plays the role of both “him” and “he” in English.)




E.11.3 quay

Pronounced “kway,” this cmavo separates parts of an English collocation in speech, e.g.

big quay shot It may often be omitted in informal speech; and in writing may be replaced by an underscore(big_shot).



Appendix FPLN and the Brain

Co-authored with Cassio Pennachin

F.1 How Might Probabilistic Logic Networks Emerge from NeuralStructures and Dynamics?

In this appendix, we digress briey to explore how PLN constructs like inheritance and similarityrelationships might emerge from brainlike structures like cell assemblies and neural activationpatterns. This is interesting as speculative neuroscience, and also potentially valuable in thecontext of hybrid architectures, in terms of tuning the interrelationship between CogPrime’sAtomspace and neural net like systems such as DeSTIN. If nothing else, the ideas of thissection serve as a conceptual argument why it makes sense to interface PLN representationsand dynamics with CSDLN representations and dynamics. While conventionally formalized anddiscussed using different languages, these different approaches to knowledge and learning areactually not so far off as is commonly believed.

We restrict ourselves here to FOPLN, which does not involve explicit variables or quantiers,and may be described as the logic of uncertain inheritance relationships. As in PLN higher-order logic reduces to rst-order logic, this is actually all we need to deal with. A neural

implementation of higher-order PLN follows from a neural representation of FOPLN plus aneural representation of higher-order functions such as the one suggested in Chapter ?? of Part1.

As described above, the semantics of the term logic relationship “A inherits from B” orA → B, is that when B is present, A is also present. The truth value of the relationshipmeasures the percentage of the times that B is present, that A is also present. “A is similar toB” or A ↔ B , is a symmetrical version, whose truth value measures the percentage of the timesthat either one is present, that both are present. These are the relations we will deal with here.

How can this be tied in with the brain? Suppose we have two assemblies A1 and A2, andthese are activated in the brain when the organism is presented with stimuli in category C1 andC2 respectively (to take the simplest case of concepts, i.e. perceptual categories). Then, we maysay that there is a neural inheritance A1 → A2, whose probabilistic strength is the number wso that

P (A1 s mean activation > T at time t )∗w

best approximates

75





F.2 Avoiding Issues with Circular Inference 77

representation, via introducing the conceptual machinery of virtual synapses and neural inher-itance.

F.2 Avoiding Issues with Circular Inference

When one develops the ideas from the previous section, connecting uncertain term logic inferencewith neurodynamics, in more detail, only one possible snag arises. Existing computationalframeworks for uncertain term logic inference utilize special mechanisms for controlling circularinference, and these mechanisms have no plausible neurological analogues. In this section weexplore this issue and argue that it’s not necessarily a big deal. In essence, our argument is thatthese biologically unnatural circularity-avoidance mechanisms are unnecessary in a probabilisticterm logic system whose operations are guided by appropriate adaptive attention-allocationmechanisms. It’s only when operating probabilistic term logic inference in isolation, in a mannerthat’s unnatural for a resource-constrained intelligent system, that these circular-inference issuesbecome severe.

We note however that this conclusion seems to be specic to probabilistic term logic, anddoesn’t seem to hold for NARS term logic, in which the circular inference problem may be moresevere, and may in fact require a trail mechanism more strongly. We have not investigated thisissue carefully.

To understand the circular inference problem, look at the triangles in Figure ?? . It’s easyto see that by performing deduction, induction and abduction in sequence, we can go aroundand around an inference triangle forever, combining the links in different orders, inferring eachlink in the triangle from the two others in different orders over and over again. What oftenhappens when you do this in a computer program performing uncertain term logic inference,however, is that after long enough the inference errors compound, and the truth values descendinto nonsense. The solution taken in the NARS and PLN uncertain term logic inference enginesis something called inference trails . Basically, each inheritance link maintains a trail, which isa list of the nodes and links used as premises in inferences determining its truth value. And arule is put in place that the link L should not be used to adjust the truth value of the link Mif M is in L’s trail.

Trails work ne for computer programs implementing uncertain term logic, though managingthem properly does involve various complexities. But, from the point of view of the brain, trailsseem quite unacceptable. It would seem implausible to hypothesize that the brain somehowstores a trail along with each virtual synapse. The brain must have some other method of avoiding circular inferences leading to truth value noisication.

In order to explore these issues, we have run a number of experiments with trail-free prob-abilistic inference. The rst of these involved doing inferences on millions of nodes and links(with nodes representing words and links derived via word co-occurrence probabilities across atext corpus). What we found was that, in practice, the severity of the circular-inference prob-lem depended on the inference control strategy. When one implemented a strategy in which theamount of attention devoted to inference about a link L was proportional to an estimate of theamount of information recently gained by doing inference about L, then one did not run intoparticularly bad problems with circular inference. On the other hand, if one operated with asmall number of nodes and links and repeatedly ran the same inferences over and over again on



78 F PLN and the Brain

them, one did sometimes run into problems with truth value degeneration, in which the termlogic formulas would cause link strengths to spuriously converge to 1 or 0.

To better understand the nature of these phenomena, we ran computer simulations of smallAtomspaces involving nodes and Inheritance relations, according to the following idea:

1. Each node is assumed to denote a certain perceptual category2. For simplicity, we assume an environment in which the probability distribution of co-

occurrences between items in the different categories is stationary over the time periodof the inference under study

3. We assume the collection of nodes and links has its probabilistic strengths updated period-ically, according to some “inference” process

4. We assume that the results of the inference process in Step 3 and the results of incorporatingnew data from the environment (Step 2) are merged together ongoingly via a weighted-averaging belief-revision process

In our simulations Step 3 was carried out via executions of PLN deduction and inversioninference rules. The results of these simulations were encouraging: most of the time, the strengthsof the nodes and links, after a while, settled into a “xed point” conguration not too distantfrom the actual probabilistic relationships implicit in the initial data. The nal congurationwas rarely equivalent to the initial conguration, but, it was usually close.

For instance one experiment involved 1000 random “inference triangles” involving 3 links,where the nodes were dened to correspond to random subsets of a xed nite set (so thatinheritance probabilities were dened simply in terms of set intersection). Given the specicdenition of the random subsets, the mean strength of each of the three inheritance relationshipsacross all the experiments was about .3. The Euclidean distance between the 3-vector of thenal (xed point) link strengths and the 3-vector of the initial link strengths was roughly .075.So the deviation from the true probabilities caused by iterated inference was not very large.Qualitatively similar results were obtained with larger networks.

The key to these experiments is the revision in Step 4: It is assumed that, as iterated inferenceproceeds, information about the true probabilities is continually merged into the results of

inference. If not for this, Step 3 on its own, repeatedly iterated, would lead to noise amplicationand increasingly meaningless results. But in a realistic inference context, one would never simplyrepeat Step 3 on its own. Rather, one would carry out inference on a node or link only when therewas new information about that node or link (directly leading to a strength update), or whensome new information about other nodes/links indirectly led to inference about that node-link.With enough new information coming in, an inference system has no time to carry out repeated,useless cycles of inference on the same nodes/links – there are always more interesting thingsto assign resources to. And the ongoing mixing-in of new information about the true strengthswith the results of iterated inference prevents the pathologies of circular inference, without theneed for a trail mechanism.

What we see from these various experiments is that if one uses an inference control mechanismthat avoids the repeated conduction of inference steps in the absence of infusion of new data,issues with circular inference are not severe, and trails are not necessary to achieve reasonable

node and link strengths via iterated inference. Circular inference can occur without great harm,so long as one only does it when relevant new data is coming in, or when there is evidence that itis generating information. This is not to say that trail mechanisms are useless in computationalsystems – they provide an interesting and sometimes important additional layer of protectionagainst circular inference pathologies. But in an inference system that is integrated with an



F.3 Neural Representation of Recursion and Abstraction 79

appropriate control mechanism they are not required. The errors induced by circular inference,in practice, may be smaller than many other errors involved in realistic inference. For instance,in the mapping between the brain and uncertain term logic proposed above, we have reliedupon a fairly imprecise proportionality between virtual synapse weight and neural inheritance.

We are not attempting to argue that the brain implements precise probabilistic inference, butonly an imprecise analogue. Circular inference pathologies are probably not the greatest sourceof imprecision.

F.3 Neural Representation of Recursion and Abstraction

The material of the previous subsection comprises a speculative but conceptually coherentconnection between brain structures and dynamics on the one hand, and probabilistic logicstructures and dynamics on the other. However, everything we have discussed so far deals onlywith rst-order term logic, i.e. the logic of inheritance relationships between terms.

Extension to handle similarity relationships, intensional inheritance and so forth is straight-forward – but what about more complex term logic constructs, such as would conventionally beexpressed using variables and quantiers. In this section we seek to address this shortcoming,via proposing a hypothesis as to how probabilistic term logic in its full generality might begrounded in neural operations. This material is even more speculative than the above ideas, yetsomething of this nature is critically necessary for completing the conceptual picture.

The handling of quantiers, in itself, is not the hard part. We have noted above that, in aterm logic framework, if one can handle probabilistic variable-bearing expressions and functions,then one can handle quantiers attached to the variables therein. So the essence of the problemis how to handle variables and functions. And we suggest that, when one investigates theissue in detail, a relatively simple hypothesis emerges clearly as essentially the only plausibleexplanation, if one adopts the neural assembly theory as a working foundational assumption.

In the existing body of mathematical logic and theoretical computer science, there are twomain approaches to handling higher-order expressions: variable-based, or combinator-based [ ? ,?]. It seems highly implausible, to us, that the human brain is implementing some sort of intricate variable-management scheme on the neural-assembly level. Lambda-calculus and otherformal schemes for manipulating variables, appear to us to require complexity and precisionof a style that self-organizing neural networks are ill-suited to produce via their own style of complex dynamics. Of course it is possible to engineer neural nets that do lambda calculus (asneural nets are Turing-complete), but this sort of neural-net structure seems unlikely to emergevia evolution, and unlikely to get created via known epigenetic processes.

But what about combinators? Here, it seems to us, things are a bit more promising. Combi-nators are higher-order functions; functions that map functions into functions; functions thatmap {functions that map functions into functions} into {functions that map functions intofunctions}, and so forth. There are specic sets of combinators that are known to give rise touniversal computational capability; indeed, there are many such specic sets, and one approachto the implementation of functional programming languages is to craft an appropriate set of combinators that combines universality with tractability (the latter meaning, basically, that thecombinators have relatively simple denitions; and that pragmatically useful logic expressionstend to have compact representations in terms of the given set of combinators).



80 F PLN and the Brain

We lack the neurodynamic knowledge to say, at this point, that any particular set of combi-nators seems likely to map into brain function. However, we may still explore the fundamentalneural functionalities that would be necessary to give rise to a combinatory-logic-style foun-dation for abstract neural computation. Essentially, what is needed is the capability to supply

one neural assembly as an input to another. Note that what we are talking about here is quitedifferent from the standard notion of chaining together neural assemblies, so that the outputof assembly A becomes the input of assembly B. Rather, what we are talking about is thatassembly A itself – as a mapping from inputs to outputs – is fed as an input to assembly B. Inthis case we may call B a higher-order neural assembly.

Of course, there are numerous possible mechanisms via which higher-order neural assembliescould be implemented in the brain. Here we will discuss just one. Consider a neural assemblyA1 with certain input neurons, certain output neurons, and certain internal “hidden layer” neurons. Then, suppose there exists a “router” neural assembly X, which is at the receiving endof connections from many neurons in A1 , including input, output and hidden neurons. SupposeX is similarly connected to many other neural assemblies: A2 , A3 , ... and so forth; and supposeX contains a “control switch” input that tells it which of these assemblies to pay attention (so,for instance, if the control input is set to 3, then X receives information about A3). When Xis paying attention to a certain assembly, it routes the information it gets from that assemblyto its outputs. (Going further, we may even posit a complex control switch, that accepts moreinvolved commands; say, a command that directs the router to a set of k of its input assemblies,and also points it to small neural assembly implementing a combination function that tells ithow to combine these k assemblies to produce a composite.)

Finally, suppose the input neurons of assembly B are connected to the router assembly X.Then, depending on how the router switch is set, B may be said to receive one of the assembliesAk as input. And, next, suppose B’s output is directed to the control switch of the router. Then,in effect, B is mapping assemblies to assemblies, in the manner of a higher-order function. Andof course, B itself is “just another neural assembly,” so that B itself may be routed by the router,allowing for assemblies that map {assemblies mapping assemblies} into assemblies, and so forth.

Where might this kind of “router” assembly exist in the brain? We don’t know, at the mo-ment. Quite possibly, the brain may implement higher-order functions by some completelydifferent mechanism. The point we want to make however, is that there are concrete possi-bilities via which the brain could implement higher-order logic according to combinatory-logictype mechanisms. Combinators might be neurally represented as neural assemblies interactingwith a router assembly, as hypothesized above, and in this way the Hebbian logic mechanismsproposed in the previous sections could be manifested more abstractly, allowing the full-scopeof logical reasoning to occur among neural assemblies, with uncertainty management mediatedby Hebbian-type synaptic modication.



Appendix GPossible Worlds Semantics and ExperientialSemantics

Co-authored with Matthew Ikle’

G.1 Introduction

The relevance of logic to AGI is often questioned, on the grounds that logic manipulates abstractsymbols, but once you’ve gured out how to translate concrete perception and action intoabstract symbols in an appropriate way, you’ve already solved the hard part of the AGI problem.In this view, human intelligence does logic-like processing as a sort of epiphenomenon, on topof a deeper and more profound layer of sub symbolic processing; and logic is more suitableas a high-level description that roughly approximates the abstract nature of certain thoughtprocesses, than as a method of actually realizing these thought processes.

Our own view is that logic is a exible tool which may be used in many different ways. Forexample, there is no particular reason not to use logic directly on sensory and actuator data,or on fairly low-level abstractions thereof. This hasn’t been the tradition in logic or logic-basedAI, but this is a matter of culture and historical accident more than anything else. This wouldgive rise to difficult scalability problems, but so does the application of recurrent neural netsor any other powerful learning approach. In CogPrime we propose to handle the lowest level of

sensory and actuator data in a different way, using a CSDLN such as DeSTIN, but we actuallybelieve a PLN approach could be used in place of a system like DeSTIN, without signicantloss of efficiency, and only moderate increase in complexity. For example, one could build aCSDLN whose internal operations were all PLN-based – this would make the compositionalspatiotemporal hierarchical structure, in effect, into an inference control mechanism.

In this appendix we will explore this region of conceptual space via digging deeper intothe semantics of PLN, looking carefully and formally at the connection between PLN termsand relationships, and the concrete experience of an AI system acting in a world. As well asproviding a more rigorous foundation for some aspects of the PLN formalism, the underlyingconceptual purpose is to more fully explicate the relationship between PLN and the world aCogPrime controlled agent lives in.

Specically, what we treat here is the relation between experiential semantics (on whichPLN, and the formal model of intelligent agents presented in Chapter ?? of Part 1, are bothfounded) and possible-worlds semantics (which forms a more mathematically and conceptuallynatural foundation for certain aspects of logic, including certain aspects of PLN). In “experien-tial semantics”, the meaning of each logical statement in an agent’s memory is dened in termsof the agent’s experiences. In “possible worlds semantics”, the meaning of a statement is denedby reference to an ensemble of possible worlds including, but not restricted to, the one the

81



82 G Possible Worlds Semantics and Experiential Semantics

agent interpreting the statement has experienced. In this appendix, for the rst time, we for-mally specify the relation between these two semantic approaches, via providing an experiential grounding of possible worlds semantics . We show how this simplies the interpretation of severalaspects of PLN, providing a common foundation for setting various PLN system parameters

that were previously viewed as distinct.The reader with a logic background should note that we are construing the notion of possibleworlds semantics broadly here, in the philosophical sense [ ?], rather than narrowly in the senseof Kripke semantics [ ?] and its relatives. In fact there are interesting mathematical connectionsbetween the present formulation and Kripke semantics and epistemic logic, but we will leavethese for later work.

We begin with indenite probabilities recalled in Chapter ?? , noting that the second-orderdistribution involved therein may be interpreted using possible worlds semantics. Then we turnto uncertain quantiers, showing that the third-order distribution used to interpret these in[?] may be considered as a distribution over possible worlds. Finally, we consider intensionalinference, suggesting that the complexity measure involved in the denition of PLN intension[?] may be derived from a probability measure over possible worlds. The moral of the story isthat by considering the space of possible worlds implicit in an agent’s experience, one arrives ata simpler unied view of various aspects of the agent’s uncertain reasoning, than if one groundsthese aspects in the agent’s experience directly. This is not an abandonment of experientialsemantics but rather an acknowledgement that a simple variety of possible worlds semantics isderivable from experiential semantics, and usefully deployable in the development of uncertaininference systems for general intelligence.

G.2 Inducing a Distribution over Predicates and Concepts

First we introduce a little preliminary formalism. Given a distribution over environments asdened in Chapter ?? of Part 1, and a collection of predicates evaluated on subsets of en-vironments, we will nd it useful to dene distributions (induced by the distribution overenvironments) dening the probabilities of these predicates.

Suppose we have a pair (F, T ) where F is a function mapping sequences of perceptions intofuzzy truth values, and T is an integer connoting a length of time. Then, we can dene theprior probability of (F, T ) as the average degree to which F is true, over a random interval of perceptions of length T drawn from a random environment drawn from the distribution overenvironments. More generally, if one has a pair (F, f ), where f is a distribution over the integers,one can dene the prior probability of (F, f ) as the weighted average of the prior probability of (F, T ) where T is drawn from f .

While expressed in terms of predicates, the above formulation can also be useful for dealingwith concepts, e.g. by interpreting the concept cat in terms of the predicate isCat . So we can usethis formulation in inferences where one needs a concept probability like P (cat ) or a relationshipprobability like P (eat (cat,mouse )) .



G.3 Grounding Possible Worlds Semantics in Experiential Semantics 83

G.3 Grounding Possible Worlds Semantics in Experiential Semantics

Now we explain how to ground a form of possible worlds semantics in experiential semantics.We explain how an agent, experiencing a single stream of perceptions, may use this to constructan ensemble of possible worlds, which may then be used in various sorts of inferences. This maysound conceptually thorny, but on careful scrutiny it’s less so, and in fact is closely related toa commonplace idea in the eld of statistics: “subsampling.”

The basic idea of subsampling is that, if one has a single dataset D which one wishes to inter-pret as coming from a larger population of possible datasets, and one wishes to approximatelyunderstand the distribution of this larger population, then one can generate a set of additionaldatasets via removing various portions of D . Each time one removes a certain portion of D, oneobtains another dataset, and one can then look at the distribution of these auxiliary datasets,considering it as a model of the population D is drawn from.

This notion ties in closely with the SRAM formal agents model of Chapter ?? of Part 1,which considers a probability distribution over a space of environments which are themselvesprobability distributions. What a real agent has is actually a single series of remembered ob-servations. But it can induce a hopeful approximation of this distribution over environmentsby subsampling its memory and asking: "What would it imply about the world if the items inthis subsample were the only things I’d seen?"

It may be conceptually useful to observe that a related notion to subsampling is found in theliterary methodology of science ction (SF). Many SF authors have followed the methodologyof starting with our everyday world, and then changing one signicant aspect, and depicting theworld as they think it might exist if this one aspect were changed (or, a similar methodologymay be followed via changing a small number of aspects). This is a way of generating a largevariety of alternate possible worlds from the raw material of our own world.

Applied to SRAM, the subsampling and SF analogies suggest two methods of creating apossible world (and hence, by repetition, an ensemble of possible worlds) from the agent’s ex-perience. An agent’s interaction sequence with its environment, ay<t = ay1: t − 1 , forms a samplefrom which it wishes to infer its environment µ(yk |ay<k ak ). To better assess this environment,

the agent may, for example,• create a possible world by removing a randomly selected collection of interactions from

the agent’s memory. In this case, the agent’s interaction sequence would be of the formI g,s,t (n t ) = ay(n t ) where (n t ) is some subsequence of 1 :t −1.

• create a possible world via assuming a counterfactual hypothesis (i.e. assigning a statementa truth value that contradicts the agent’s experience), and using inference to construct aset of observations that is as similar to its memory as possible, subject to the constraint of being consistent with the hypothesis. The agent’s interaction sequence would then look likebz1: t − 1 , where some collection of the bk zk differ from the corresponding ak yk .

• create a possible world by reorganizing portions of the interaction sequence.

• create a possible world by some combination of the above.

We denote an alteration of an interaction sequence I ag,s,t for an agent a by I ag,s,t , and the setof all such altered interaction sequences for agent a by I a .

In general, an agent’s interaction sequence will presumably be some reasonably likely se-quence, and we would therefore be most interested in those cases for which dI (I ag,s,t , I ag,s,t ) issmall, where dI (·, ·) is some measure of sequence similarity, such as neighborhood correlation orPSI-BLAST. The probability distribution ν over environments µ will then tend to give larger




probabilities to nearby sequences, as measured by the chosen similarity measure, than to onesthat are far away. In colloquial terms, an agent would typically be interested in consideringonly minor hypothetical changes to its interaction sequences, and would have little basis forunderstanding the consequences of drastic alterations.

Any of the above methods for altering interaction sequences would alter an agent’s perceptionsequence causing changes to the fuzzy truth values mapped by the function F . This in turn wouldyield new probability distributions over the space of possible worlds, and thereby yielding alteredaverage probability values for the pair (F, T ). This change, constructed from the perspective of the agent based on its experience, could then cause the agent to reassess its action a . Broadlyspeaking, we call these approaches “experiential possible worlds” or EPW.

The creation of altered interaction sequences may, under appropriate assumptions, providea basis for creating better estimates for the predicate F than we would otherwise have from asingle real-world data point. More specically we have the following results.

Theorem 1 Let E n represent an arbitrary ensemble of n agents chosen from A. Suppose that,on average over the set of agents a ∈ E n , the set of values F (I ) for mutated interaction sequences I is normal and unbiased, so that,

E [F ] = 1n

a∈E n I ag,s,t ∈I a

F (I ag,s,t )P (I ag,s,t ).

Suppose further that these agents explore their environments by creating hypothetical worlds via altered interaction sequences. Then an unbiased estimate for E [F ] is given by

F = 1n


F (I ag,s,t )P (I ag,s,t )

= 1n


F (I ag,s,t )e∈E

[P (e|I ag,s,t )P (I ag,s,t |e)].

Proof. That F is an unbiased estimate for E [F ] follows as a direct application of standardstatistical bootstrapping theorems. See, for example, [ ?].

Theorem 2 Suppose that in addition to the above assumptions, we assume that the predicate F is Lipschitz continuous as a function of the interaction sequences I ag,s,t . That is,

dF F (I ag,s,t ), F (I ag,s,t ) ≤ Kd I (I ag,s,t , I ag,s,t ),

for some bound K and dF (·, ·) is a distance measure in predicate space. Then, setting both the bias correction and acceleration parameters to zero, the bootstrap BC α condence interval for the mean of F satises

F BC α [α] ⊂ [ F −Kz (α ) σI , F + Kz (α ) σI ]

where σI is the standard deviation for the altered interaction sequences and, letting Φ denote the standard normal c.d.f., z (α ) = Φ− 1(α).

Proof. Note that the Lipschitz condition gives



G.3 Grounding Possible Worlds Semantics in Experiential Semantics 85

σ2F =

1n|I a | −1 ×


d2F F (I ag,s,t ), F (I ag,s,t ) P (I ag,s,t )

≤ K 2

n|I a | −1a∈E n I a

g,s,t ∈I ad2

I (I ag,s,t , I ag,s,t )P (I ag,s,t )

= K 2 σ2I .

Since the population is normal and the bias correction and acceleration parameters are bothzero, the BC α bootstrap condence interval reduces to the standard condence interval, andthe result then follows [ ?].

These two theorems together imply that, on average, through subsampling via altered interac-tion sequences, agents can obtain unbiased approximations to F and, by keeping the deviationsfrom their experienced interaction sequence small, the deviations of their approximations willalso be small.

While the two theorems above demonstrate the power of our subsampling approach, the Lip-schitz condition in theorem 2 is a strong assumption. This observation motivates the followingmodication that is more in keeping with the avor of PLN’s indenite probabilities approach.

Theorem 3 Dene the set

I a ;b = I ag,s,t |d2F F (I ag,s,t ), F (I ag,s,t = b ,

and assume that for every real number b the perceptions of the predicate F satisfy

1n

a∈E n

P (I a ;b) ≤ M (b)

b2 σ2I

for some M (b) ∈R . Further suppose that

1

0M (b) db = M 2∈

R .

Then under the same assumptions as in Theorem 1, and again setting both the bias correction and acceleration parameters to zero, we have

F BC α [α] ⊂ [ F −M √ nz (α )

σI , F + M √ nz (α )

σI ]

Proof.





G.4 Reinterpreting Indenite Probabilities 87

In this scenario, one can view the second-order distribution, as a distribution over all threecourses of action that Fluffy might take. Each rst-order distribution would then representthe probability distribution of the result from the corresponding action. By hypotheticallyconsidering all three possible courses of action and the probability distributions of the resulting

action, Fluffy can make more rational decisions even though no result is guaranteed.

G.4.1 Reinterpreting Indenite Quantiers

EPW also allows PLN’s universal, existential and fuzzy quantiers to be expressed in terms of implications on fuzzy sets. For example, if we have

ForAll $X Implication

Evaluation F $X Evaluation G $X

then this is equivalent to

AverageQuantier $XImplication

Evaluation F ∗ $XEvaluation G∗ $X

where e.g. F ∗ is the fuzzy set of variations on F constructed by assuming possible errors inthe historical evaluations of F . This formulation yields equivalent results to the one given in[?], but also has the property of reducing quantiers to FOPLN (over sets derived from specialpredicates).

To fully understand the equivalence of the above two expressions, rst note that in [ ?],we handle quantiers by introducing third-order probabilities. As discussed there, the threelevels of distributions are roughly as follows. The rst- and second-order levels play the role,with some modications, of standard indenite probabilities. The third-order distribution thenplays the role of “perturbing” the second-order distribution. The idea is that the second-orderdistribution represents the mean for the statement F (x). The third-order distribution thengives various values of F(x) for x, and the rst-order distribution gives the sub-distributionsfor each of the second-order distributions. The nal result is then found via an averagingprocess on all those second-order distributions that are “almost entirely” contained in someForAll _ proxy _ interval .

Next, AverageQuantifier F ($X ) is a weighted average of F ($X ) over all relevant inputs$X ; and we dene the fuzzy set F ∗ as the set of perturbations of a second-order distributionof hypotheses, and G∗ as the corresponding set of perturbed implication results. With these

denitions, not only does the above equivalence follow naturally, so do the “possible/perturbedworlds” semantics for the ForAll quantier. Other quantiers, including fuzzy quantiers, canbe similarly recast.




G.5 Specifying Complexity for Intensional Inference

A classical dichotomy in logic involves the distinction between extensional inference (which in-volves sets with members) and intensional inference (which involves entities with properties).In PLN this is handled by taking extension as the foundation (where, in accordance with expe-riential semantics, sets ultimately boil down to sets of elementary observations), and deningintension in terms of certain fuzzy sets involving observation-sets. This means that in PLNintension, like higher-order inference, ultimately emerges as a subcase of FOPLN (though asubcase with special mathematical properties and special interest for cognitive science and AI).However, the prior formulation of PLN intension contains a “free parameter” (a complexity mea-sure) which is conceptually inelegant; EPW remedies this via providing this parameter with afoundation in possible worlds semantics.

To illustrate how, in PLN, higher-order intensional inference reduces to rst-order infer-ences, consider the case of intensional inheritance. IntensionalInheritance A B measures theextensional inheritance between the set of properties or patterns associated with A and thecorresponding set associated with B. This concept is made precise via formally dening theconcept of “pattern,“ founded on the concept of “association.” We formally dene the associa-tion operator ASSOC through:

ExtensionalEquivalenceMember $E (ExOut ASSOC $C)ExOut

FuncList

ExtensionalInheritance $E $CExtensionalInheritance

NOT $E$C

where Func (x, y ) = [x −y]+ and + denotes the positive part.We next dene a pattern in an entity A as something that is associated with, but simpler

than, A. Note that this denition presumes some measure c() of complexity. One can then denethe fuzzy-set membership function called the “pattern-intensity," via

IN (F, G ) = [c(G) −c(F )]+ [P (F |G) −P (F |¬G)]+ .

measuring how much G is a pattern of F . The complexity measure c has been left unspeciedin prior explications of PLN, but in the present context we may take it as the measure overconcepts implied by the measure over possible worlds derived via subsampling or counterfactualsas described above.

G.6 Reinterpreting Implication between Inheritance Relationships

Finally, one more place where possible worlds semantics plays a role in PLN is with implicationssuch as



G.7 Conclusion 89

ImplicationInheritance Ben AmericanInheritance Ben obnoxious

We can interpret these by introducing predicates over possible worlds, so that e.g.

Z Inheritance _ Ben _ American (W ) tv

denotes that tv is the truth value of Inheritance _ Ben _ American in world W . A prerequisitefor this, of course, is that Ben and American be dened in a way that spans the space of possible worlds in question. In the case of possible worlds dened by differing subsets of the sameobservation-set, this is straightforward; in the case of possible worlds dened via counterfactualsit is subtler and we will omit details here.

The above implication may then be interpreted as

AverageQuantier $W Implication

Evaluation Z Inheritance _ Ben _ American $W Evaluation Z Inheritance _ Ben _ obnoxious $W

The weighting over possible worlds $W may be taken as the one obtained by the system throughthe subsampling or counterfactual methods as indicated above.

G.7 Conclusion

We began with the simple observation that the mind of an intelligent agent accumulates knowl-edge based on experience, yet also creates hypothetical knowledge about “the world as it might

be,” which is useful for guiding future actions. PLN handles this dichotomy via beginning from afoundation in experiential semantics, but then using a form of experientially-grounded possible-worlds semantics to ground a number of particular logical constructs, which we have reviewedhere. The technical details we have provided illustrate the general thesis that a combinationof experiential and possible-worlds notions may be the best approach to comprehending thesemantics of declarative knowledge in generally intelligent agents.





Appendix HPropositions About Environments in WhichCogPrime Components Are Useful

H.1 Propositions about MOSES

Why is MOSES a good approach to automated program learning? The conceptual argument infavor of MOSES may be broken down into a series of propositions, which are given here bothin informal “slogan” form and in semi-formalized “proposition” form.

Note that the arguments given here appear essentially applicable to other MOSES-relatedalgorithms such as Pleasure as well. The page however was originally written in regard toMOSES and hasn’t been revised in the light of the creation of Pleasure.

Slogan 1 refers to “ENF”, Elegant Normal Form, which is used by MOSES as a standardformat for program trees. This is a way that MOSES differs from GP for example: GP does nottypically normalize program trees into a standard syntactic format, but leaves trees heteroge-neous as to format.

H.1.1 Proposition: ENF Helps to Guide Syntax-Based Program SpaceSearch

Slogan 1Iterative optimization is guided based on syntactic distance == > ENF is goodProposition 1 :On average, over a class C of tness functions, it is better to do optimization based on a

representation in which the (average over all functions in C of the) correlation between syntacticand semantic distance is larger. This should hold for any optimization algorithm which makes aseries of guesses, in which the new guesses are chosen from the old ones in a way that is biasedto choose new guesses that have small syntactic distance to the old one.

Note that GA, GP, BOA, BOAP and MOSES all fall into the specied category of optimiza-tion algorithms

It is not clear what average smoothness condition is useful here. For instance, one could lookat the average of d(f(x),f(y))/d(x,y) for d(x,y) < A, where d is syntactic distance and A is chosenso that the optimization algorithm is biased to choose new guesses that have syntactic distanceless than A from the old ones.

91



92 H Propositions About Environments in Which CogPrime Components Are Useful

H.1.2 Demes are Useful if Syntax/Semantics Correlations in Program Space Have a Small Scale

This proposition refers to the strategy of using “demes” in MOSES: instead of just evolvingone population of program trees, a collection of “demes” are evolved, each one a population of program trees that are all somewhat similar to each other.

Slogan 2 Small-scale syntactic/semantic correlation == > demes are good [If the maximalsyntactic/semantic correlation occurs on a small scale, then multiple demes are useful]

Proposition 2 : Let d denote syntactic distance, and d1 denote semantic distance. Sup-pose that the correlation between d(x,y) and d1(x,y) is much larger for d(x,y) < A than forA< d(x,y) < 2A or A< d(x,y), as an average across all tness functions in class C. Suppose thenumber of spheres of radius R required to cover the space of all genotypes is n(R). Then usingn(R) demes will provide signicantly faster optimization than using n(2R) demes or 1 deme.Assume here the same conditions on the optimization algorithm as in Proposition 1.

Proposition 2.1 : Consider the class of tness functions dened byCorrelation( d(x,y), d1(x,y) || d(x,y) = a ) = b

Then, across this class, there is a certain number D of demes that will be optimal on average....I.e. the optimal number of demes depends on the scale-dependence of the correlation betweensyntactic & semantic distance....

H.1.3 Probabilistic Program Tree Modeling Helps in the Presence of Cross-Modular Dependencies

This proposition refers to the use of BOA-type program tree modeling within MOSES. Whatit states is that this sort of modeling is useful if the programs in question have signicantcross-modular dependences that are not extremely difficult to detect.

Slogan 3 Cross-modular dependencies == > BOA is good [If the genotypes possess sig-nicant internal dependencies that are not concordant with the genotypes’ internal modularstructure, then BOA-type optimization will signicantly outperform GA/GP-type optimizationfor deme-exemplar extension.]

Proposition 3 : Consider the classication problem of distinguishing t genotypes from lesst genotypes, within a deme. If signicantly greater classication accuracy can be obtained byclassication rules containing “cross-terms” combining genotype elements that are distant fromeach other within the genotypes - and these cross-terms are not too large relative to the increasein accuracy they provide - then BOA-type modeling will signicantly outperform GA/GP-typeoptimization.

The catch in Proposition 3 is that the BOA-type modeling must be sophisticated enough torecognize the specic cross-terms involved, of course.

H.1.4 Relating ENF to BOA

Now, how does BOA learning relate to ENF?



H.1 Propositions about MOSES 93

Proposition 4 : ENF decreases, on average, the number and size of cross-terms in the clas-sication rules mentioned in Proposition 3.

H.1.5 Conclusion Regarding Speculative MOSES Theory

What we see from the above is that:

• ENF is needed in order to make the tness landscape smoother, but can almost neverwork perfectly so there will nearly always be some long-distance dependencies left afterENF-ization

• The smoother tness landscape enabled by ENF, enables optimization using demes andincremental exemplar-expansion to work, assuming the number of demes is chosen intelli-gently

• Within a deme, optimization via incremental exemplar growth is more efficient using BOAthan straight evolutionary methods, due to the ability of BOA to exploit the long-distance

dependencies not removed by ENF-izationThese propositions appear to capture the basic conceptual justication for the current MOSESmethodology. Of course, proving them will be another story, and will likely involve making theproposition statements signicantly more technical and complex.

Another interesting angle on these propositions is to view them as constraints on the problem type to which MOSES may be fruitfully applied. Obviously, no program learning algorithm canoutperform random search on random program learning problems. MOSES, like any otheralgorithm, needs to be applied to problems that match its particular biases. What sorts of problems match MOSES’s biases?

In particular, the right question to ask is: Given a particular choice regarding syntacticprogram representation, what sorts of problems match MOSES’s biases as induced by thischoice?

If the above propositions are correct, the answer is, basically: Problems for which semanticdistance (distance in tness) is moderately well-correlated with syntactic distance (in the chosenrepresentation) over a short scale but not necessarily over a long scale, and for which a signicantpercentage of successful programs have a moderate but not huge degree of internal complexity(as measured by internal cross-module dependencies).

Implicit in this is an explanation of why MOSES, on its own, is likely not a good approachto solving extremely large and complex problems. This is because for an extremely large andcomplex problem, the degree of internal complexity of successful programs will likely be toohigh for BOA modeling to cope with. So then, in these cases MOSES will effectively operate asa multi-start local search on normalized program trees, which is not a stupid thing, but unlikelyto be adequately effective for most large, complex problems.

We see from the above that even in the case of MOSES, which is much simpler than OCP,formulating the appropriate theory adequately is not a simple thing, and proving the relevant

propositions may be fairly difficult. However, we can also see from the MOSES example thatthe creation of a theoretical treatment does have some potential for clarifying the nature of thealgorithm and its likely range of applicability.





H.2 Propositions About CogPrime 95

“experiential learning.” This proposition pertains to the conditions under which Hebbian-style,inductive PLN inference control can be useful.

Slogan 6 If similar theorems generally have similar proofs, then inductively-controlled PLNcan work effectively

Proposition 6 :

• Let L0 = a simple “base level” theorem-proving framework, with xed control heuristics

• For n > 0, let Ln = theorem-proving done using Ln − 1 , with inference control done usingdata mining over a DB of inference trees, utilizing Ln − 1 to nd recurring patterns amongthese inference trees that are potentially useful for controlling inference

Then, if T is a set of theorems so that, within T, theorems that are similar according to “similarity provable in Ln − 1 using effort E” have proofs that are similar according to the samemeasure, then Ln will be effective for proving theorems within T

H.2.3 Clustering-together of Smooth Theorems

This proposition is utilized within Theorem 8, below, which again has to do with PLN inferencecontrol.

Slogan 7 “Smooth” theorems tend to cluster together in theorem-spaceProposition 7 : Dene the smoothness of a theorem as the degree to which its proof is

similar to the proofs of other theorems similar to it. Then, smoothness varies smoothly intheorem-space. I.e., a smooth theorem tends to be close-by to other smooth theorems.

H.2.4 When PLN is Useful Within MOSES

Above it was argued that PLN is useful within MOSES due to its capability to take account of history (across multiple tness functions). But this is not the only reason to utilize PLN withinMOSES; Propositions 6 and 7 above give us another theoretical reason.

Proposition 8 : If similar theorems of the form “Program A is likely to have similar behaviorto program B” tend to have similar proofs, and the conditions of Slogan 6 hold for the classof programs in question, then inductively controlled PLN is good (and better than BOA) forexemplar extension. (This is basically Proposition 6 + Proposition 7)

H.2.5 When MOSES is Useful Within PLN

We have explored theoretical reasons why PLN should be useful within MOSES, as a replace-ment for the BOA step used in the standalone implementation of MOSES. The next few propo-sitions work in the opposite direction, and explore rasons why MOSES should be useful withinPLN, for the specic problem of nding elements of a set given a qualitative (intensional) de-scription of a set. (This is not the only use of MOSES for helping PLN, but it is a key use anda fairly simple one to address from a theoretical perspective.)




Proposition 9 : In a universe of sets where intensional similarity and extensional similarityare well-correlated, the problem of nding classication rules corresponding to a set S leads to apopulation of decently t candidate solutions with high syntactic/semantic correlation so thatdemes are good for this problem.

Proposition 10 : In a universe of sets satisfying Proposition 9, where sets have propertieswith complex interdependencies, BOA will be useful for exemplar extension (in the context of using demes to nd classication rules corresponding to sets).

Proposition 11 : In a universe of sets satisfying Proposition 10, where the interdependenciesassociated with a set S’s property-set vary “smoothly” as S varies, working inference is betterthan BOA for exemplar extension.

Proposition 12 : In a universe of sets satisfying Proposition 10, where the proof of theoremsof the form “Both the interdependencies of S’s properties, and the interdependencies of T’sproperties, satisfy predicate F” depends smoothly on the theorem statement, then inductivelycontrolled PLN will be effective for exemplar extension.

H.2.6 On the Smoothness of Some Relevant TheoremsWe have talked a bit about smooth theorems, but what sorts of theorems will tend to besmooth? If the OCP design is to work effectively, the “relevant” theorems must be smooth; andthe following proposition gives some evidence as to why this may be the case.

Proposition 13 : In a universe of sets where intensional similarity and extensional similarityare well-correlated, probabilistic theorems of the form “A is a probabilistic subset of B” and “Ais a pattern in B” tend to be smooth.

Note that: For a set S of programs, to say “intensional similarity and extensional similarityare well-correlated” among subsets of S, means the same thing as saying that syntactic andsemantic similarity are well-correlated among members of S.

Proposition 14 : The set of motor control programs, for a set of standard actuators likewheels, arms and legs, displays a reasonable level of correlation between syntactic and semanticsimilarity.

Proposition 15 : The set of sentences that are legal in English displays a high level of correlation between syntactic and semantic similarity.

(The above is what, in Chaotic Logic [ ?], was called the “principle of continuous composi-tionality”, extending Frege’s Principle of Compositionality. It implies that language is learnablevia OCP-type methods.... Unlike the other Propositions formulated here, it is more likely to beaddressable via statistical than formal mathematical means; but insofar as English syntax canbe formulated formally, it may be considered a roughly-stated mathematical proposition.)

H.2.7 Recursive Use of “MOSES with PLN” to Help With Attention Allocation

Proposition 16 : The set of propositions of the form “When thinking about A is useful, think-ing about B is often also useful” tends to be smooth - if “thinking” consists of MOSES plus



H.3 Concluding Remarks 97

inductively controlled PLN, and the universe of sets is such that this cognitive approach isgenerally a good one.

This (Prop. 16) implies that adaptive attention allocation can be useful for a MOSES+PLNsystem, if the attention allocation itself utilizes MOSES+PLN.

H.2.8 The Value of Conceptual Blending

Proposition 17 : In a universe of sets where intensional similarity and extensional similarityare well-correlated, if two sets A and B are often useful in proving theorems of the form “C isa (probabilistic) subset of D”, then “blends” of A and B will often be useful for proving suchtheorems as well.

This is a justication of conceptual blending for concept formation.

H.2.9 A Justication of Map Formation Proposition 18 : If a collection of terms A is often used together in MOSES+PLN, then similarcollections B will often be useful as well, for this same process ... assuming the universe of setsis so that intensional and extensional similarity are correlated, and MOSES+PLN works well.

This is a partial justication of map formation, in that nding collections B similar to A isachieved by encapsulating A into a node A’ and then doing reasoning on A’.

H.3 Concluding Remarks

The above set of propositions is certainly not complete. For instance, one might like to throw inconjunctive pattern mining as a rapid approximation to MOSES; and some specic justicationof articial economics as a path to effectively utilizing MOSES/PLN for attention allocation;etc.

But, overall, it seems fair to say that the above set of propositions smells like a possiblyviable path to a theoretical justication of the OCP design.

To summarize the above ideas in a nutshell, we may say that the effectiveness of the OCPdesign appears intuitively to follow from the assumptions that:

• within the space of relevant learning problems, problems dened by similar predicates tendto have somewhat similar solutions

• according to OCP’s knowledge representation, procedures and predicates with very similarbehaviors often have very similar internal structures, and vice versa (and this holds to adrastically lesser degree if the “very” is removed)

• for relevant theorems (“theorems” meaning Atoms whose truth values need to be evaluated,or whose variables or SatisfyingSets need to be lled in, via PLN): similar theorems tendto have similar proofs, and the degree to which this holds varies smoothly in proof-space

• the world can be well modeled using sets for which intensional and extensional similarityare well correlated: meaning that the mind can come up with a system of “extensional




categories” useful for describing the world, and displaying characteristic patterns that arenot too complex to be recognized by the mind’s cognitive methods

To really make use of this sort of theory, of course, two things would need to be done. Forone thing, the propositions would have to be proved (which will probably involve some seriousadjustments to the proposition statements). For another thing, some detailed argumentationwould have to be done regarding why the “relevant problems” confronting an embodied AGIsystem actually fulll the assumptions. This might turn out to be the hard part, becausethe class of “relevant problems” is not so precisely dened. For very specic problems like - toname some examples quasi-randomly - natural language learning, object recogntion, learning tonavigate in a room with obstacles, or theorem-proving within a certain dened scope, however,it may be possible to make detailed arguments as to why the assumptions should be fullled.

Recall that what makes OCP different from huge-resources AI designs like AIXI (includingAIXI tl ) and the Gödel Machine is that it involves a number of specialized components, each withtheir own domains and biases and some with truly general potential as well, hooked together inan integrative architecture designed to foster cross-component interaction and overall synergyand emergence. The strength and weakness of this kind of architecture is that it is specialized

to a particular class of environments. AIXItl

and the Gödel Machine can handle any type of environment roughly equally well (which is: very, very slowly), whereas, CogPrime has thepotential to be much faster when it’s in an environment that poses it learning problems thatmatch its particular specializations. What we have done in the above series of propositions is topartially formalize the properties an environment must have to be “CogPrime-friendly.” If thepropositions are essentially correct, and if interesting real-world environments largely satisfytheir assumptions, then OCP is a viable AGI design.

Documents

Engineering General Intelligence Appendices B-H