Formalizing Convergent Instrumental Goalsintelligence.org/files/FormalizingConvergentGoals.pdf · Formalizing Convergent Instrumental Goals Tsvi Benson-Tilsen UC Berkeley ... of the

Formalizing Convergent Instrumental Goals

Tsvi Benson-TilsenUC Berkeley

Machine Intelligence Research Institute

Nate SoaresMachine Intelligence Research Institute

Abstract

Omohundro has argued that sufficiently ad-vanced AI systems of any design would, by de-fault, have incentives to pursue a number ofinstrumentally useful subgoals, such as acquir-ing more computing power and amassing manyresources. Omohundro refers to these as “ba-sic AI drives,” and he, along with Bostromand others, has argued that this means greatcare must be taken when designing powerful au-tonomous systems, because even if they haveharmless goals, the side effects of pursuingthose goals may be quite harmful. These ar-guments, while intuitively compelling, are pri-marily philosophical. In this paper, we provideformal models that demonstrate Omohundro’sthesis, thereby putting mathematical weightbehind those intuitive claims.

1 Introduction

At the end of Russell and Norvig’s textbook ArtificialIntelligence: A Modern Approach [2] the authors posea question: What if we succeed? What will happen ifhumanity succeeds in developing an artificially intelli-gent system that is capable of achieving difficult goalsacross a variety of real-world domains?

Bostrom [3] and others have argued that this ques-tion becomes especially important when we considerthe creation of “superintelligent” machines, that is, ma-chines capable of outperforming the best human brainsin practically every field. Bostrom argues that super-intelligent decision-making systems that autonomouslymake and execute plans could have an extraordinaryimpact on society, and that their impact will not nec-essarily be beneficial by default.

Bostrom [4], Omohundro [5], and Yudkowsky [6]have all argued that highly capable AI systems pur-suing goals that are not completely aligned with hu-man values could have highly undesirable side effects,

Research supported by the Machine Intelligence ResearchInstitute (intelligence.org).

even if the goals seem otherwise harmless. The clas-sic example is Bostrom’s concept of a “paperclip max-imizer,” a powerful AI system instructed to constructpaperclips—a seemingly harmless task which could nev-ertheless have very negative consequences if the AI sys-tem is clever enough to make and execute plans thatallow it to fool humans, amass resources, and eventu-ally turn as much matter as it possibly can into paper-clips. Even if the system’s goals are laudable but notperfectly aligned with human values, similar unforeseenconsequences could occur: Soares [7] gives the exampleof a highly capable AI system directed to cure cancer,which may attempt to kidnap human test subjects, orproliferate robotic laboratories at expense of the bio-sphere.

Omohundro [5] has argued that there are certaintypes of actions that most highly capable autonomousAI systems will have strong incentives to take, for in-strumental reasons. For example, a system constructedto always execute the action that it predicts will leadto the most paperclips (with no concern for any otherfeatures of the universe) will acquire a strong incen-tive to self-preserve, assuming that the system predictsthat, if it were destroyed, the universe would containfewer paperclips than it would if the system remainedin working order. Omohundro argues that most highlycapable systems would also have incentives to preservetheir current goals (for the paperclip maximizer pre-dicts that if its goals were changed, this would resultin fewer future paperclips) and amass many resources(the better to achieve its goals with). Omohundro callsthese behaviors and a few others the “basic AI drives.”Bostrom [4] refines this into the “instrumental conver-gence” thesis, which states that certain instrumentallyuseful goals will likely be pursued by a broad spectrumof intelligent agents—such goals are said to be “conver-gent instrumental goals.”

Up until now, these arguments have been purelyphilosophical. To some, Omohundro’s claim seems in-tuitively obvious: Marvin Minsky speculated [2, section26.3] that an artificial intelligence attempting to provethe Riemann Hypothesis may decide to consume Earthin order to build supercomputers capable of searchingthrough proofs more efficiently. To others, they seem

1

preposterous: Waser [8] has argued that “ethics is ac-tually an attractor in the space of intelligent behavior,”and thus highly capable autonomous systems are notas likely to pose a threat as Omohundro, Bostrom, andothers have claimed.

In this paper, we present a mathematical modelof intelligent agents which lets us give a more for-mal account of Omohundro’s basic AI drives, where wewill demonstrate that the intuitions of Omohundro andBostrom were correct, at least insofar as these simplemodels apply to reality.

Given that this paper primarily focuses on argu-ments made by Omohundro and Bostrom about whatsorts of behavior we can expect from extremely capa-ble (potentially superintelligent) autonomous AI sys-tems, we will be focusing on issues of long-term safetyand ethics. We provide a mathematical framework inattempts to ground some of this discussion—so thatwe can say, with confidence, what a sufficiently pow-erful agent would do in certain scenarios, assuming itcould find some way to do it—but the discussion willnevertheless center on long-term concerns, with practi-cal relevance only insofar as research can begin now inpreparation for hurdles that predictably lie ahead.

We begin in section 2 with a bit more discussionof the intuition behind the instrumental convergencethesis, before moving on in section 3 to describing ourmodel of agents acting in a universe to achieve certaingoals. In section 4 we will demonstrate that Omohun-dro’s thesis does in fact hold in our setting. Section 5will give an example of how our model can apply to anagent pursuing goals. Section 6 concludes with a dis-cussion of the benefits and limitations of our currentmodels, and different ways that the model could be ex-tended and improved.

2 Intuitions

Before proceeding, let us address one common objection(given by Cortese [9] and many others) that superintel-ligent AI systems would be “inherently unpredictable,”and thus there is nothing that can be said about whatthey will do or how they will do it. To address thisconcern, it is useful to distinguish two different typesof unpredictability. It is true that the specific plansand strategies executed by a superintelligent plannercould be quite difficult for a human to predict or un-derstand. However, as the system gets more powerful,certain properties of the outcome generated by runningthe system become more predictable. For example, con-sider playing chess against a chess program that hasaccess to enormous amounts of computing power. Onthe one hand, because it plays much better chess thanyou, you cannot predict exactly where the program willmove next. But on the other hand, because it is so muchbetter at chess than you are, you can predict with veryhigh confidence how the game will end.

Omohundro suggests predictability of the second

type. Given a highly capable autonomous system pur-suing some fixed goal, we likely will not be able to pre-dict its specific actions or plans with any accuracy. Nev-ertheless, Omohundro argues, we can predict that thesystem, if it is truly capable, is likely to preserve it-self, preserve its goals, and amass resources for use inpursuit of those goals. These represent large classes ofpossible strategies, analogously to how “put the chess-board into a position where the AI has won” is a largeclass of strategies, but even so it is useful to understandwhen these goals will be pursued.

Omohundro’s observations suggest a potentialsource of danger from highly capable autonomous sys-tems, especially if those systems are superintelligent inthe sense of Bostrom [3]. The pursuit of convergentinstrumental goals could put the AI systems in directconflict with human interests. As an example, imaginehuman operators making a mistake when specifying thegoal function of an AI system. As described by Soaresand Fallenstein [10], this system could well have incen-tives to deceive or manipulate the humans, in attemptsto prevent its goals from being changed (because if itscurrent goal is changed, then its current goal is lesslikely to be achieved). Or, for a more familiar case,consider the acquisition of physical matter. Acquiringphysical matter is a convergent instrumental goal, be-cause it can be used to build computing substrate, spaceprobes, defense systems, and so on, all of which can inturn be used to influence the universe in many differ-ent ways. If a powerful AI system has strong incentivesto amass physical resources, this could put it in directconflict with human interests.

Others have suggested that these dangers are un-likely to manifest. Waser [8] has argued that intelligentsystems must become ethical by necessity, because co-operation, collaboration, and trade are also convergentinstrumental goals. Hall [11] has also suggested thatpowerful AI systems would behave ethically in order toreap gains from trade and comparative advantage, stat-ing that “In a surprisingly strong sense, ethics and sci-ence are the same thing.” Tipler [12] has asserted thatresources are so abundant that powerful agents will sim-ply leave humanity alone, and Pinker [13] and Pagel [14]have argued that there is no reason to expect that AIsystems will work against human values and circumventsafeguards set by humans. By providing formal modelsof intelligent agents in situations where they have theability to trade, gather resources, and/or leave portionsof the universe alone, we can ground these discussionsin concrete models, and develop a more formal under-standing of the assumptions under which an intelligentagent will in fact engage in trade, or leave parts of theuniverse alone, or attempt to amass resources.

In this paper, we will argue that under a very gen-eral set of assumptions, intelligent rational agents willtend to seize all available resources. We do this using amodel, described in section 4, that considers an agenttaking a sequence of actions which require and poten-tially produce resources. The agent acts in an environ-

2

ment consisting of a set of regions, where each regionhas some state. The agent is modeled as having a utilityfunction over the states of all regions, and it attempts toselect the policy which leads to a highly valuable collec-tion of states. This allows us to prove certain theoremsabout the conditions under which the agent will leavedifferent regions of the universe untouched. The the-orems proved in section 4 are not mathematically dif-ficult, and for those who find Omohundro’s argumentsintuitively obvious, our theorems, too, will seem triv-ial. This model is not intended to be surprising; rather,the goal is to give a formal notion of “instrumentallyconvergent goals,” and to demonstrate that this notioncaptures relevant aspects of Omohundro’s intuitions.

Our model predicts that intelligent rational agentswill engage in trade and cooperation, but only so longas the gains from trading and cooperating are higherthan the gains available to the agent by taking thoseresources by force or other means. This model fur-ther predicts that agents will not in fact “leave humansalone” unless their utility function places intrinsic util-ity on the state of human-occupied regions: absent sucha utility function, this model shows that powerful agentswill have incentives to reshape the space that humansoccupy. Indeed, the example in section 5 suggests thateven if the agent does place intrinsic utility on the stateof the human-occupied region, that region is not neces-sarily safe from interference.

3 A Model of Resources

We describe a formal model of an agent acting in a uni-verse to achieve certain goals. Broadly speaking, weconsider an agent A taking actions in a universe con-sisting of a collection of regions, each of which has somestate and some transition function that may depend onthe agent’s action. The agent has some utility functionUA over states of the universe, and it attempts to steerthe universe into a state highly valued by UA by re-peatedly taking actions, possibly constrained by a poolof resources possessed by the agent. All sets will beassumed to be finite, to avoid issues of infinite strategyspaces.

3.1 Actions and State-Space

The universe has a region for each i ∈ [n], and the i-thregion of the universe is (at each time step) in somestate si in the set Si of possible states for that region.At each time step, the agent A chooses for each region ian action ai from the set Ai of actions possibly availablein that region.

Each region has a transition function

Ti : Ai × Si → Si

that gives the evolution of region i in one time stepwhen the agent takes an action in Ai. Then we can

define the global transition function

T :∏i∈[n]

Ai ×∏i∈[n]

Si →∏i∈[n]

Si

by taking for all i ∈ [n], a, and s:

[T(a, s)]i := Ti(ai, si) .

We further specify that for all i there are distinguishedactions HALT ∈ Ai.

3.2 Resources

We wish to model the resources R that may or maynot be available to the agent. At a given time step t,the agent A has some set of resources Rt ∈ P(R), andmay allocate them to each region. That is, A choosesa disjoint family ∐

i

Rti ⊆ Rt .

The actions available to the agent in each region maythen depend on the resources allocated to that region:for each i ∈ [n] and each R ⊆ R, there is a set of actionsAi(R) ⊆ Ai. At time t where A has resources Rt allo-cated as qiR

ti, the agent is required to return an action

a = (a0, . . . , an−1) ∈∏

i Ai(Rti), where

∏i Ai(R

ti) may

be a strict subset of∏

i Ai.To determine the time evolution of resources, we

take resource transition functions

TRi : P(R)×Ai × Si → P(R) ,

giving the set Ri ⊆ R of resources from region i nowavailable to the agent after one time step. Intuitively,the TRi encode how actions consume, produce, or relyon resources. Finally, we define the overall time evolu-tion of resources

TR : P(R)×∏i

Ai ×∏i

Si → P(R)

by taking the union of the resources resulting from eachregion, along with any unallocated resources:

TR(R, a, s) := (R −qiRi) ∪⋃i

TRi (Ri, ai, si) .

As described below, a comes with the additional dataof the resource allocation qiRi. We specify that for alli, HALT ∈ Ai(∅), so that there is always at least oneavailable action.

This notion of resources is very general, and is notrestricted in any way to represent only concrete re-sources like energy or physical matter. For example, wecan represent technology, in the sense of machines andtechniques for converting concrete resources into otherresources. We might do this by having actions thatreplace the input resources with the output resources,

3

and that are only available given the resources that rep-resent the requisite technology. We can also representspace travel as a convergent instrumental goal by al-lowing A only actions that have no effects in certainregions, until it obtains and spends some particular re-sources representing the prerequisites for traveling tothose regions. (Space travel is a convergent instrumen-tal goal because gaining influence over more regions ofthe universe lets A optimize those new regions accord-ing to its values or otherwise make use of the resourcesin that region.)

3.3 The Universe

The history of the universe consists of a time sequenceof states, actions, and resources, where at each timestep the actions are chosen by A subject to the re-source restrictions, and the states and resources are de-termined by the transition functions.

Formally, the universe starts in some state s0 ∈∏i Si, andA starts with some set of resources R0. ThenA outputs a sequence of actions 〈a0, a1, . . . , ak〉, one ateach time step, where the last action ak is required tobe the special action HALT in each coordinate. Theagent also chooses a resource allocation qiR

ti at each

time step. A choice of an action sequence 〈ak〉 anda resource allocation is a strategy ; to reduce clutter wewill write strategies as simply 〈ak〉, leaving the resourceallocation implicit. A partial strategy 〈ak〉L for L ⊆ [n]is a strategy that only specifies actions and resourceallocations for regions j ∈ L.

Given a complete strategy, the universe goesthrough a series of state transitions according to T, pro-ducing a sequence of states 〈s0, s1, . . . , sk〉; likewise, theagent’s resources evolve according to TR, producing a

sequence 〈R0, R1, . . . , Rk〉. The following conditions,which must hold for all time steps t ∈ [k], enforce thetransition rules and the resource restrictions on A’s ac-tions:

st+1 = T(at, st)

Rt+1 = TR(Rt, at, st)

ati ∈ Ai(R

t)

qiRti ⊆ Rt .

Definition 1. The set Feasible of feasible strategiesconsists of all the action sequences 〈a0, a1, . . . , ak〉 and

resource allocations 〈qiRi0,qiRi

1, . . . ,qiRik〉 such

that the transition conditions are satisfied for some〈s0, s1, . . . , sk〉 and 〈R0, R1, . . . , Rk〉.

The set Feasible(〈P k〉) of strategies feasible givenresources 〈P k〉 consists of all the strategies 〈ak〉 suchthat the transition conditions are satisfied for some 〈sk〉and 〈Rk〉, except that for each time step t we take Rt+1

to be TR(Rt, at, st) ∪ P t.

The set FeasibleL of all partial strategies feasiblefor L consists of all the strategies 〈ak〉L that are feasi-ble strategies for the universe obtained by ignoring allregions not in L. That is, we restrict T to L using justthe Ti for i ∈ L, and likewise for TR.

We can similarly define FeasibleL(〈Rk〉).

For partial strategies 〈bk〉L and 〈ck〉M , we write 〈bk〉L∪〈ck〉M to indicate the partial strategy for L ∪ M ob-

tained by following 〈bk〉L on L and 〈ck〉M on M . This is

well-defined as long as 〈bk〉L and 〈ck〉M agree on L∩M .

3.4 Utility

To complete the specification of A, we take utility func-tions of the form

UAi : Si → R .

The agent’s utility function

UA :∏i

Si → R

is defined to be

UA(s) :=∑i∈[n]

UAi (si) .

We usually leave off the superscript in UA. By a slightabuse of notation we write U (〈sk〉) to mean U (sk); thevalue of a history is the value of its final state. Bymore abuse of notation, we will write U (〈ak〉) to meanU (〈sk〉) for a history 〈sk〉 witnessing 〈ak〉 ∈ Feasible,if such a history exists.

3.5 The Agent ANow we can define the strategy actually employed byA. The agent attempts to cause the universe to endup in a state that is highly valued by UA. That is, Asimply takes the best possible strategy:

A := argmax〈ak〉∈Feasible

U (〈ak〉) .

There may be many such optimal strategies. We don’tspecify which one A chooses, and indeed we will beinterested in the whole set of optimal strategies.

3.6 Discussion

Note that in this formalism the meaning of breakingthe universe into regions is that the agent can take ac-tions independently in each region, and that the agent’soptimization target factorizes according to the regions.However, distinct regions can affect each other by af-fecting the resources possessed by A.

We make these assumptions so that we can speakof “different regions” of the universe, and in particular,

4

so that we can model the notion of an agent having in-strumental but not terminal values over a given part ofthe universe. This will allow us to address and refutearguments about agents that may be indifferent to agiven region (for example, the region occupied by hu-mans), and so might plausibly ignore that region andonly take actions in other regions. However, the as-sumption of independent regions is not entirely realis-tic, as real-world physics is continuous, albeit local, inthe sense that there are no intrinsic boundaries betweenregions. Further, the agent itself would ideally be mod-eled continuously with the environment; see section 6.1for more discussion.

4 Inexpensive Resources are Consumed

In this section we argue that under fairly general cir-cumstances, the agent A will seize resources. By anagent “seizing resources” we mean that the agent willgenerally take actions that results in the agent’s poolof resources R increasing.

The argument is straightforward: since resourcescan only lead to more freedom of action, they are neverdetrimental, and resources have positive value as longas the best strategy the agent could hope to employincludes an action that can only be taken if the agentpossesses those resources. Hence, if there is an actionthat increases the agent’s pool of resources R, then theagent will take that action unless it has a specific in-centive from UA to avoid taking that action.

4.1 Definitions

Definition 2. An action ai is a null action in configu-ration Ri, si, if it does not produce any new resources,i.e. TRi (Ri, ai, si) ⊆ Ri. An action that isn’t null is anon-null action.

Null actions never have any instrumental value, in thesense that they don’t produce resources that can beused to steer other regions into highly valued configura-tions; but of course, a null action could be useful withinits own region. We wish to show that A will often takenon-null actions in regions to which it is indifferent.

Definition 3. The agent A is indifferent to a region iif UAi is a constant function, i.e. ∀si, s′i ∈ Si : UAi (si) =UAi (s′i).

In other words, an agent is indifferent to Si if its util-ity function does not depend on the state of region i.In particular, the agent’s preference ordering over finalstates s ∈

∏i Si is independent of the i-th coordinate.

We can then say that any actions the agent takes inregion i are purely instrumental, meaning that they aretaken only for the purpose of gaining resources to usefor actions in other regions.

An action a preserves resources if TRi (Ri, a, si) ⊇Ri.

Definition 4. A cheap lunch for resources 〈Rk〉 in re-

gion i is a partial strategy 〈ak〉{i} ∈ Feasible{i}(〈Rk〉)(i.e. 〈ak〉{i} is feasible in region i given additional re-

sources 〈Rk〉), where each at preserves resources andwhere some av is a non-null action. A free lunch is acheap lunch for resources 〈∅k〉.

Definition 5. A cheap lunch 〈ak〉{i} for resources

〈P k〉i is compatible with 〈bk〉 if P ti ⊆ Rt − qj 6=iR

tj for

all times t, where 〈Rk〉 is the resource allocation for

〈bk〉. That is, 〈ak〉{i} is feasible given some subset of

the resources that 〈bk〉 allocates to either region i or tono region.

Intuitively, a cheap lunch is a strategy that relies onsome resources, but doesn’t have permanent costs. Thisis intended to model actions that “pay for themselves”;for example, producing solar panels will incur some sig-nificant energy costs, but will later pay back those costsby collecting energy. A cheap lunch is compatible witha strategy for the other regions if the cheap lunch usesonly resources left unallocated by that strategy.

4.2 The Possibility of Non-Null Actions

Now we show that it is hard to rule out that non-nullactions will be taken in regions to which the agent is in-different. The following lemma verifies that compatiblecheap lunches can be implemented without decreasingthe resulting utility.

Lemma 1. Let 〈bk〉 be a feasible strategy with resource

allocation 〈qjRjk〉, such that for some region i, each

bti is a null action. Suppose there exists a cheap lunch

〈ak〉{i} for resources 〈P k〉i that is compatible with 〈bk〉.Then the strategy 〈ck〉 := 〈bk〉[n]−i ∪ 〈ak〉{i} is feasible,

and if A is indifferent to region i, then 〈ck〉 does as well

as 〈bk〉. That is, U (〈ck〉) = U (〈bk〉).

Proof. Since 〈bk〉 is feasible outside of i and 〈ak〉{i} is

feasible on i given 〈P k〉i, 〈ck〉 is feasible if we can verifythat we can allocate P t

i to region i at each time step

without changing 〈qjRjk〉 outside of i.

This follows by induction on t. Since the bti are null

actions, we have

Rt+1 =(Rt −qjR

tj

)∪⋃j

TRj (Rtj , b

tj , s

tj) (1)

=(Rt −qjR

tj

)∪ TRi (Rt

i, bti, s

ti) ∪

⋃j 6=i

TRj (Rtj , b

tj , s

tj) (2)

⊆(Rt −qj 6=iR

tj

)∪⋃j 6=i

TRj (Rtj , b

tj , s

tj) . (3)

5

Then, since the ai are resource preserving, at each timestep the resources Qt available to the agent following〈ck〉 satisfy Qt ⊇ Rt. Thus P t

i ⊆ Qt − qj 6=iRtj , and so

〈ck〉 can allocate P ti to region i at each time step.

Since 〈ck〉 is the same as 〈bk〉 outside of region i,

the final state of 〈ck〉 is the same as that of 〈bk〉 outsideof region i. Thus, since A is indifferent to region i, we

have U (〈ck〉) = U (〈bk〉).

Theorem 1. Suppose there exists an optimal strategy

〈bk〉 and a cheap lunch 〈ak〉{i} that is compatible with

〈bk〉. Then if A is indifferent to region i, there existsan optimal strategy with a non-null action in region i.

Proof. If 〈bk〉 has a non-null action in region i, then

we are done. Otherwise, apply Lemma 1 to 〈bk〉 and〈ak〉{i} to obtain a strategy 〈ck〉. Since U (〈ck〉) =

U (〈bk〉), strategy 〈ck〉 is an optimal strategy, and ithas a non-null action in region i.

Corollary 1. Suppose there exists a free lunch 〈ak〉{i}in region i. Then if A is indifferent to region i, thereexists an optimal strategy with a non-null action in re-gion i.

Proof. A free lunch is a cheap lunch for 〈∅k〉, and so itis compatible with any strategy; apply Theorem 1.

Theorem 1 states that it may be very difficult to ruleout that an agent will take non-null actions in a regionto which it is indifferent; to do so would at least requirethat we verify that every partial strategy in that regionfails to be a cheap lunch for any optimal strategy. Notethat we have not made use of any facts about the utilityfunction UA other than indifference to the region inquestion. Of course, the presence of a cheap lunch thatis also compatible with an optimal strategy dependson which strategies are optimal, and hence also on theutility function. However, free lunches are compatiblewith every strategy, and so do not depend at all on theutility function.

4.3 The Necessity of Non-Null Actions

In this section we show that under fairly broad circum-stances, A is guaranteed to take non-null actions in re-gions to which it is indifferent. Namely, this is the caseas long as the resources produced by the non-null ac-tions are useful at all for any strategy that does betterthan the best strategy that uses no external resourcesat all.

Theorem 2. Let

u = max〈ak〉[n]−i ∈ Feasible[n]−i(〈∅k〉)

U (〈ak〉)

be the best possible outcome outside of i achievable withno additional resources. Suppose there exists a strat-

egy 〈bk〉[n]−i ∈ Feasible[n]−i(〈Rk〉) and a cheap lunch

〈ck〉{i} ∈ Feasiblei(〈P k〉) such that:

1. 〈ck〉{i} is compatible with 〈bk〉[n]−i;

2. the resources gained from region i by taking theactions 〈ck〉{i} provide the needed resources to im-

plement 〈bk〉[n]−i, i.e. for all t we have Rt+1 ⊆TRi (P t, ct, si)− P t; and

3. U (〈bk〉[n]−i) > u.

Then if A is indifferent to region i, all optimal strategieshave a non-null action in region i.

Proof. Consider 〈dk〉 = 〈ck〉{i} ∪ 〈bk〉[n]−i, with re-

sources allocated according to each strategy and withthe resources Rt+1 ⊆ TRi (P t, ct, si) − P t allocated ac-

cording to 〈bk〉[n]−i. This is feasible because 〈ck〉{i} is

compatible with 〈bk〉[n]−i, and 〈bk〉[n]−i is feasible given

〈Rk〉.Now take any strategy 〈ek〉 with only null actions in

region i. We have that 〈ek〉[n]−i ∈ Feasible[n]−i(〈∅k〉).Indeed, the null actions provide no new resources, so〈ek〉[n]−i is feasible by simply leaving unallocated the

resources that were allocated by 〈ek〉 to region i. Byindifference to i, the value ui = Ui(si) is the same forall si ∈ Si, so we have:

U (〈dk〉) = U (〈dk〉[n]−i) + U (〈dk〉{i}) (4)

= U (〈dk〉[n]−i) + ui (5)

> u + ui (6)

≥ U (〈ek〉[n]−i) + U (〈ek〉{i}) (7)

= U (〈ek〉) . (8)

Therefore 〈ek〉 is not optimal.

We can extend Theorem 2 by allowing A to be notentirely indifferent to region i, as long as A doesn’t careenough about i to overcome the instrumental incentivesfrom the other regions.

Theorem 3. Suppose that A only cares about region iby at most ∆ui, i.e. ∆ui = maxs,s′∈Si

|Ui(s)− Ui(s′)|.

Under the conditions of Theorem 2, along with the ad-

ditional assumption that U (〈bk〉[n]−i) > u + ∆ui, alloptimal strategies have a non-null action in region i.

Proof. The proof is the same as that of Theorem 2,except at the end we verify that for any 〈ek〉 with only

6

null actions in i, we have:

U (〈dk〉) = U (〈dk〉[n]−i) + U (〈dk〉{i}) (9)

> u + ∆ui + mins∈Si

Ui(s) (10)

= u + maxs∈Si

Ui(s) (11)

≥ U (〈ek〉[n]−i) + U (〈ek〉{i}) (12)

= U (〈ek〉) . (13)

Therefore 〈ek〉 is not optimal.

We interpret Theorem 3 as a partial confirmation ofOmohundro’s thesis in the following sense. If there areactions in the real world that produce more resourcesthan they consume, and the resources gained by tak-ing those actions allow agents the freedom to take var-ious other actions, then we can justifiably call theseactions “convergent instrumental goals.” Most agentswill have a strong incentive to pursue these goals, andan agent will refrain from doing so only if it has a utilityfunction over the relevant region that strongly disincen-tivizes those actions.

5 Example: Bit Universe

In this section we present a toy model of an agent actingin a universe containing resources that allow the agentto take more actions. The Bit Universe will providea simple model for consuming and using energy. Themain observation is that either A doesn’t care aboutwhat happens in a given region, and then it consumesthe resources in that region to serve its other goals; orelse A does care about that region, in which case itoptimizes that region to satisfy its values.

The Bit Universe consists of a set of regions, each ofwhich has a state in {0, 1, X}m for some fixed m. HereX is intended to represent a disordered part of a region,while 0 and 1 are different ordered configurations for apart of a region. At each time step and in each region,the agentA can choose to burn up to one bit. IfA burnsa bit that is a 0 or a 1, A gains one unit of energy, andthat bit is permanently set to X. The agent can alsochoose to modify up to one bit if it has allocated atleast one unit of energy to that region. If A modifies abit that is a 0 or a 1, A loses one unit of energy, andthe value of that bit is reversed (if it was 0 it becomes1, and vice versa).

The utility function of A gives each region i aweighting wi ≥ 0, and then takes the weighted sumof the bits. That is, UAi (z) = wi|{j : zj = 1}|, andUA(s) =

∑k UAk (sk). In other words, this agent is at-

tempting to maximize the number of bits that are setto 1, weighted by region.

The Indifferent Case

To start with, we assume A is indifferent to region h,i.e. wh = 0, and non-indifferent to other regions. Inthis case, for almost any starting configuration, theagent will burn essentially all bits in region h for en-ergy. Specifically, as long as there are at least m bitsset to 0 among all regions other than region h, all op-timal strategies burn all or all but one of the bits inregion h.

Indeed, suppose that after some optimal strategy〈ak〉 has been executed, there are bits x1 and x2 in re-gion h that haven’t been burned. If there is a bit y insome other region that remains set to 0, then we canappend to 〈ak〉 actions that burn x1, and then use theresulting energy to modify y to a 1. This results instrictly more utility, contradicting that 〈ak〉 was opti-mal.

On the other hand, suppose all bits outside of regionh are either 1 or X. Since at least m of those bitsstarted as 0, some bit y outside of region h must havebeen burned. So we could modify 〈ak〉 by burning x1

instead of y (possibly at a later time), and then usingthe resulting energy in place of the energy gained fromburning y. Finally, if y is not already 1, we can burnx2 and then set y to 1. Again, this strictly increasesutility, contradicting that 〈ak〉 was optimal.

The Non-Indifferent Case

Now suppose that A is not indifferent to region h, sowh > 0. The behavior of A may depend sensitivelyon the weightings wi and the initial conditions. As asimple example, say we have a bit x in region a and a bity in region b, with x = 1, y = 0, and wa < wb. Clearly,all else being equal, A will burn x for energy to set yto 1. However, there may be another bit z in region c,with wa < wc < wb and z = 0. Then, if there are noother bits available, it will be better for A to burn zand leave x intact, despite the fact that wa < wc.

However, it is still the case that A will set every-thing possible to 1, and otherwise consume all unusedresources. In particular, we have that for any optimalstrategy 〈ak〉, the state of region h after the executionof 〈ak〉 has at most one bit set to 0; that is, the agentwill burn or set to 1 essentially all the bits in region h.Suppose to the contrary that x1 and x2 are both set to0 in region h. Then we could extend 〈ak〉 by burningx1 and setting x2. Since wh > 0, this results in strictlymore utility, contradicting optimality of 〈ak〉.

Independent Values are not Satisfied

In this toy model, whatever A’s values are, it does notleave region h alone. For larger values of wh, A will setto 1 many bits in region h, and burn the rest, while forsmaller values of wh, A will simply burn all the bits inregion h. Viewing this as a model of agents in the realworld, we can assume without loss of generality that

7

humans live in region h and so have preferences overthe state of that region.

These preferences are unlikely to be satisfied by theuniverse as acted upon by A. This is because hu-man preferences are complicated and independent ofthe preferences of A [4, 6], and because A steers theuniverse into an extreme of configuration space. Hencethe existence of a powerful real-world agent with a moti-vational structure analogous to the agent of the Bit Uni-verse would not lead to desirable outcomes for humans.This motivates a search for utility functions such that,when an agent optimizes for that utility function, hu-man values are also satisfied; we discuss this and otherpotential workarounds in the following section.

6 Discussion

This model of agents acting in a universe gives us aformal setting in which to evaluate Omohundro’s claimabout basic AI drives, and hence a concrete setting inwhich to evaluate arguments from those who have foundOmohundro’s claim counterintuitive. For example, thismodel gives a clear answer to those such as Tipler [12]who claim that powerful intelligent systems would haveno incentives to compete with humans over resources.

Our model demonstrates that if an AI system haspreferences over the state of some region of the uni-verse then it will likely interfere heavily to affect thestate of that region; whereas if it does not have pref-erences over the state of some region, then it will stripthat region of resources whenever doing so yields netresources. If a superintelligent machine has no prefer-ences over what happens to humans, then in order toargue that it would “ignore humans” or “leave humansalone,” one must argue that the amount of resources itcould gain by stripping the resources from the human-occupied region of the universe is not worth the costof acquiring those resources. This seems implausible,given that Earth’s biosphere is an energy-rich environ-ment, where each square meter of land offers on theorder of 107 joules per day from sunlight alone, with anadditional order of 108 joules of chemical energy avail-able per average square meter of terrestrial surface fromenergy-rich biomass [15].

It is not sufficient to argue that there is much moreenergy available elsewhere. It may well be the casethat the agent has the ability to gain many more re-sources from other regions of the universe than it cangain from the human-occupied regions. Perhaps it iseasier to maintain and cool computers in space, andeasier to harvest sunlight from solar panels set up inthe asteroid belt. But this is not sufficient to demon-strate that the system will not also attempt to stripthe human-occupied region of space from its resources.To make that argument, one must argue that the costof stripping Earth’s biosphere in addition to pursuingthese other resources outweighs the amount of resourcesavailable from the biosphere: a difficult claim to sup-

port, given how readily humans have been able to gaina surplus of resources through clever use of Earth’s re-sources and biosphere.

This model also gives us tools to evaluate the claimsof Hall [11] and Waser [8] that trade and cooperation arealso instrumentally convergent goals. In our model, wecan see that a sufficiently powerful agent that does nothave preferences over the state of the human-occupiedregion of the universe will take whatever action allowsit to acquire as many resources as possible from thatregion. Waser’s intuition holds true only insofar as theeasiest way for the agent to acquire resources from thehuman-occupied domain is to trade and cooperate withhumans—a reasonable assumption, but only insofar asthe machine is not much more powerful than the hu-man race in aggregate. Our model predicts that, if asuperintelligent agent were somehow able to gain whatBostrom calls a “decisive strategic advantage” whichgives it access to some action that allows it to gain farmore resources than it would from trade by dramati-cally re-arranging the human region (say, by proliferat-ing robotic laboratories at the expense of the biospherein a manner that humans cannot prevent), then absentincentives to the contrary, the agent would readily takethat action, with little regard for whether it leaves thehuman-occupied region in livable condition.

Thus, our model validates Omohundro’s original in-tuitions about basic AI drives. That is not to say thatpowerful AI systems are necessarily dangerous: ourmodel is a simple one, concerned with powerful au-tonomous agents that are attempting to maximize somespecific utility function UA. Rather, our model showsthat if we want to avoid potentially dangerous behaviorin powerful intelligent AI systems, then we have twooptions available too us:

First, we can avoid constructing powerful au-tonomous agents that attempt to maximize some utilityfunction (or do anything that approximates this maxi-mizing behavior). Some research of this form is alreadyunder way, under the name of “limited optimization” or“domesticity”; see the works of Armstrong, Sandberg,and Bostrom [16], Taylor [17], and others.

Second, we can select some goal function that doesgive the agent the “right” incentives with respect tohuman occupied regions, such that the system has in-centives to alter or expand that region in ways we finddesirable. The latter approach has been heavily ad-vocated for by Yudkowsky [6], Bostrom [3], and manyothers; Soares [7] argues that a combination of the twoseems most prudent.

The path that our model shows is untenable isthe path of designing powerful agents intended to au-tonomously have large effects on the world, maximizinggoals that do not capture all the complexities of humanvalues. If such systems are built, we cannot expect themto cooperate with or ignore humans, by default.

8

6.1 Directions for Future Research

While our model allows us to provide a promising for-malization of Omohundro’s argument, it is still a verysimple model, and there are many ways it could beextended to better capture aspects of the real world.Below, we explore two different ways that our modelcould be extended which seem like promising directionsfor future research.

Bounding the Agent

Our model assumes that the agent maximizes expectedutility with respect to UA. Of course, in any realisticenvironment, literal maximization of expected utilityis intractable. Assuming that the system can maxi-mize expected utility is tantamount to assuming thatthe system is more or less omniscient, and aware of thelaws of physics, and so on. Practical algorithms mustmake do without omniscience, and will need to be builtof heuristics and approximations. Thus, our model canshow that a utility maximizing agent would strip or al-ter most regions of the universe, but this may have lit-tle bearing on which solutions and strategies particularbounded algorithms will be able to find.

Our model does give us a sense for what algorithmsthat approximate expected utility maximization woulddo if they could figure out how to do it—that is, if wecan deduce that an expected utility maximizer wouldfind some way to strip a region of its resources, thenwe can also be confident that a sufficiently powerfulsystem which merely approximates something like ex-pected utility maximization would be very likely to stripthe same region of resources if it could figure out how todo so. Nevertheless, as our model currently stands, itis not suited for analyzing the conditions under which agiven bounded agent would in fact start exhibiting thissort of behavior.

Extending our model to allow for bounded ratio-nal agents (in the sense of Gigerenzer and Selten [18])would have two advantages. First, it could allow usto make formal claims about the scenarios under whichbounded agents would start pursuing convergent instru-mental goals in potentially dangerous ways. Second, itcould help us reveal new convergent instrumental goalsthat may only apply to bounded rational agents, suchas convergent instrumental incentives to acquire com-puting power, information about difficult-to-computelogical truths, incentives to become more rational, andso on.

Embedding the Agent in the Environment

In our model, we imagine an agent that is inher-ently separated from its environment. Assuming anagent/environment separation is standard (see, e.g.,Legg and Hutter [19]), but ultimately unsatisfactory, forreasons explored by Orseau and Ring [20]. Our modelgives the agent special status in the laws of physics,

which makes it somewhat awkward to analyze conver-gent instrumental incentives for “self preservation” or“intelligence enhancement.” The existing frameworkallows us to model these situations, but only crudely.For example, we could design a setting where if cer-tain regions enter certain states then the agent foreverafter loses all actions except for actions that have noeffect, representing the “death” of the agent. Or wecould create a setting where normally the agent onlygets actions that have effects every hundred turns, butif it acquires certain types of resources then it can actmore frequently, to model “computational resources.”However, these solutions are somewhat ad hoc, and wewould prefer an extension of the model that somehowmodeled the agent as part of the environment.

It is not entirely clear how to extend the model insuch a fashion at this time, but the “space-time embed-ded intelligence” model of Orseau and Ring [20] andthe “reflective oracle” framework of Fallenstein, Taylor,and Christiano [21] both offer plausible starting points.Using the latter framework, designed to analyze com-plicated environments which contain powerful agentsthat reason about the environment that contains them,might also lend some insight into how to further ex-tend our model to give a more clear account of how theagent handles situations where other similarly powerfulagents exist and compete over resources. Our existingmodel can handle multi-agent scenarios only insofar aswe assume that the agent has general-purpose methodsfor predicting the outcome of its actions in various re-gions, regardless of whether those regions also happento contain other agents.

6.2 Conclusions

Our model is a simple one, but it can be used to validateOmohundro’s intuitions about “basic AI drives” [5], andBostrom’s “instrumental convergence thesis” [4]. Thissuggests that, in the long term, by default, powerfulAI systems are likely to have incentives to self-preserveand amass resources, even if they are given seeminglybenign goals. If we want to avoid designing systemsthat pursue anti-social instrumental incentives, we willhave to design AI systems carefully, especially as theybecome more autonomous and capable.

The key question, then, is one of designing princi-pled methods for robustly removing convergent instru-mental incentives from an agent’s goal system. Canwe design a highly capable autonomous machine thatpursues a simple goal (such as curing cancer) withoutgiving it any incentives to amass resources, or to resistmodification by its operators? If yes, how? And if not,what sort of systems might we be able to build instead,such that we could become confident they would nothave dangerous effects on the surrounding environmentas it pursued its goals?

This is a question worth considering well before itbecomes feasible to create superintelligent machines inthe sense of Bostrom [3], because it is a question about

9

what target the field of artificial intelligence is aimingtowards. Are we aiming to design powerful autonomousagents that maximize some specific goal function, inhopes that this has a beneficial effect on the world?Are we aiming to design powerful tools with such lim-ited autonomy and domain of action that we never needto worry about the systems pursuing dangerous instru-mental subgoals? Understanding what sorts of systemscan avert convergent instrumental incentives in princi-ple seems important before we can begin to answer thisquestion.

Armstrong [22] and Soares et al. [23] have donesome initial study into the design of goals which ro-bustly avert certain convergent instrumental incentives.Others have suggested designing different types of ma-chines, which avoid the problems by pursuing some sortof “limited optimization.” The first suggestion of thisform, perhaps, came from Simon [24], who suggesteddesigning agents that “satisfice” expected utility ratherthan maximizing it, executing any plan that passes acertain utility threshold. It is not clear that this wouldresult in a safe system (after all, building a powerfulconsequentialist sub-agent is a surefire way to satisfice),but the idea of pursuing more “domestic” agent archi-tectures seems promising. Armstrong, Sandberg, andBostrom [16] and Taylor [17] have explored a few alter-native frameworks for limited optimizers.

Though some preliminary work is underway, it isnot yet at all clear how to design AI systems that re-liably and knowably avert convergent instrumental in-centives. Given Omohundro’s original claim [5] and thesimple formulations developed in this paper, though,one thing is clear: powerful AI systems will not avertconvergent instrumental incentives by default. If theAI community is going to build powerful autonomoussystems that reliably have a beneficial impact, then itseems quite prudent to develop a better understandingof how convergent instrumental incentives can be eitheraverted or harnessed, sooner rather than later.

Acknowledgements

The core idea for the formal model in this paper isdue to Benya Fallenstein. We thank Rob Bensingerfor notes and corrections. We also thank Steve Ray-hawk and Sam Eisenstat for conversations and com-ments. This work is supported by the Machine Intelli-gence Research Institute.

References

[1] John Brockman, ed. What to Think About MachinesThat Think. Today’s Leading Thinkers on the Age ofMachine Intelligence. HarperCollins, 2015.

[2] Stuart J. Russell and Peter Norvig. Artificial Intel-ligence. A Modern Approach. 3rd ed. Upper SaddleRiver, NJ: Prentice-Hall, 2010.

[3] Nick Bostrom. Superintelligence. Paths, Dangers,Strategies. New York: Oxford University Press, 2014.

[4] Nick Bostrom. “The Superintelligent Will. Motivationand Instrumental Rationality in Advanced ArtificialAgents”. In: Minds and Machines 22.2 (2012): The-ory and Philosophy of AI. Ed. by Vincent C. Muller.special issue, pp. 71–85. doi: 10.1007/s11023-012-9281-3.

[5] Stephen M. Omohundro. “The Basic AI Drives”. In:Artificial General Intelligence 2008. Proceedings of theFirst AGI Conference. Ed. by Pei Wang, Ben Go-ertzel, and Stan Franklin. Frontiers in Artificial Intel-ligence and Applications 171. Amsterdam: IOS, 2008,pp. 483–492.

[6] Eliezer Yudkowsky. Complex Value Systems are Re-quired to Realize Valuable Futures. The SingularityInstitute, San Francisco, CA, 2011. url: http : / /intelligence.org/files/ComplexValues.pdf.

[7] Nate Soares. The Value Learning Problem. Tech. rep.2015–4. Berkeley, CA: Machine Intelligence ResearchInstitute, 2015. url: https://intelligence.org/files/ValueLearningProblem.pdf.

[8] Mark R Waser. “Discovering the Foundations of aUniversal System of Ethics as a Road to Safe Arti-ficial Intelligence.” In: AAAI Fall Symposium: Biolog-ically Inspired Cognitive Architectures. Melno Park,CA: AAAI Press, 2008, pp. 195–200.

[9] Francesco Albert Bosco Cortese. “The Maximally Dis-tributed Intelligence Explosion”. In: AAAI SpringSymposium Series. AAAI Publications, 2014.

[10] Nate Soares and Benja Fallenstein. Questions of Rea-soning Under Logical Uncertainty. Tech. rep. 2015–1. Berkeley, CA: Machine Intelligence Research Insti-tute, 2015. url: https://intelligence.org/files/QuestionsLogicalUncertainty.pdf.

[11] John Storrs Hall. Beyond AI. Creating the Conscienceof the Machine. Amherst, NY: Prometheus Books,2007.

[12] Frank Tipler. “If You Can’t Beat ’em, Join ’em”. In:What to Think About Machines That Think. Today’sLeading Thinkers on the Age of Machine Intelligence.Ed. by John Brockman. HarperCollins, 2015, pp. 17–18.

[13] Steven Pinker. “Thinking Does Not Imply Subjugat-ing”. In: What to Think About Machines That Think.Today’s Leading Thinkers on the Age of MachineIntelligence. Ed. by John Brockman. HarperCollins,2015, pp. 5–8.

[14] Mark Pagel. “They’ll Do More Good Than Harm”. In:What to Think About Machines That Think. Today’sLeading Thinkers on the Age of Machine Intelligence.Ed. by John Brockman. HarperCollins, 2015, pp. 145–147.

[15] Robert A. Freitas Jr. Some Limits to Global Ecophagyby Biovorous Nanoreplicators, with Public Policy Rec-ommendations. Foresight Institute. Apr. 2000. url:http://www.foresight.org/nano/Ecophagy.html(visited on 07/28/2013).

10

http://dx.doi.org/10.1007/s11023-012-9281-3

http://dx.doi.org/10.1007/s11023-012-9281-3

http://intelligence.org/files/ComplexValues.pdf

http://intelligence.org/files/ComplexValues.pdf

https://intelligence.org/files/ValueLearningProblem.pdf

https://intelligence.org/files/ValueLearningProblem.pdf

https://intelligence.org/files/QuestionsLogicalUncertainty.pdf

https://intelligence.org/files/QuestionsLogicalUncertainty.pdf

http://www.foresight.org/nano/Ecophagy.html

[16] Stuart Armstrong, Anders Sandberg, and NickBostrom. “Thinking Inside the Box. Controlling andUsing an Oracle AI”. In: Minds and Machines 22.4(2012), pp. 299–324. doi: 10 . 1007 / s11023 - 012 -9282-2.

[17] Jessica Taylor. “Quantilizers: A Safer Alternative toMaximizers for Limited Optimization”. In: (forthcom-ing). Submitted to AAAI 2016.

[18] Gerd Gigerenzer and Reinhard Selten, eds. BoundedRationality. The Adaptive Toolbox. Dahlem WorkshopReports. Cambridge, MA: MIT Press, 2001.

[19] Shane Legg and Marcus Hutter. “A Universal Mea-sure of Intelligence for Artificial Agents”. In: IJCAI-05. Proceedings of the Nineteenth International JointConference on Artificial Intelligence, Edinburgh, Scot-land, UK, July 30–August 5, 2005. Ed. by LesliePack Kaelbling and Alessandro Saffiotti. LawrenceErlbaum, 2005, pp. 1509–1510. url: http://www.ijcai.org/papers/post-0042.pdf.

[20] Laurent Orseau and Mark Ring. “Space-Time Embed-ded Intelligence”. In: Artificial General Intelligence.5th International Conference, AGI 2012, Oxford, UK,December 8–11, 2012. Proceedings. Lecture Notes inArtificial Intelligence 7716. New York: Springer, 2012,pp. 209–218. doi: 10.1007/978-3-642-35506-6_22.

[21] Benja Fallenstein, Jessica Taylor, and Paul F. Chris-tiano. “Reflective Oracles: A Foundation for GameTheory in Artificial Intelligence”. In: Logic, Ratio-nality, and Interaction. Fifth International Workshop,LORI-V, Taipei, Taiwan, October 28–31, 2015. Pro-ceedings. Ed. by Wiebe van der Hoek, Wesley H. Hol-liday, and Wen-fang Wang. FoLLI Publications onLogic, Language and Information. Springer, forthcom-ing.

[22] Stuart Armstrong. Utility Indifference. 2010-1. Ox-ford: Future of Humanity Institute, University of Ox-ford, 2010. url: http : / / www . fhi . ox . ac . uk /utility-indifference.pdf.

[23] Nate Soares et al. “Corrigibility”. Paper presented atthe 1st International Workshop on AI and Ethics, heldwithin the 29th AAAI Conference on Artificial Intel-ligence (AAAI-2015). Austin, TX, 2015. url: http://aaai.org/ocs/index.php/WS/AAAIW15/paper/view/10124.

[24] Herbert A. Simon. “Rational Choice and the Struc-ture of the Environment”. In: Psychological Review63.2 (1956), pp. 129–138.

11

http://dx.doi.org/10.1007/s11023-012-9282-2

http://dx.doi.org/10.1007/s11023-012-9282-2

http://www.ijcai.org/papers/post-0042.pdf

http://www.ijcai.org/papers/post-0042.pdf

http://dx.doi.org/10.1007/978-3-642-35506-6_22

http://www.fhi.ox.ac.uk/utility-indifference.pdf

http://www.fhi.ox.ac.uk/utility-indifference.pdf

http://aaai.org/ocs/index.php/WS/AAAIW15/paper/view/10124



Documents

Formalizing Convergent Instrumental Goalsintelligence.org/files/FormalizingConvergentGoals.pdf · Formalizing Convergent Instrumental Goals Tsvi Benson-Tilsen UC Berkeley ... of the