Abstractions for devising compact controllers for MDPs€¦ · compact controllers for MDPs Research Thesis Submitted in partial ful llment of the requirements for the degree of Master

Abstractions for devising

compact controllers for

MDPs

Kolman Vornovitsky

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

Abstractions for devising

compact controllers for

MDPs

Research Thesis

Submitted in partial fulfillment of the requirements

for the degree of Master of Science in Computer Science

Kolman Vornovitsky

Submitted to the Senate of

the Technion — Israel Institute of Technology

Av 5770 Haifa July 2011

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

The research thesis was carried out under the supervision of Associate Pro-

fessor Carmel Domshlak, Faculty of Industrial Engineering and Manage-

ment, Technion.

The generous financial support of the Technion is gratefully acknowledged.

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

Contents

Abstract 1

Abbreviations and Notations 3

1 Introduction 4

2 Background 8

2.1 Value iteration algorithm . . . . . . . . . . . . . . . . . . . . 13

2.2 (Modified) Policy iteration algorithm . . . . . . . . . . . . . . 14

2.3 Planning for Factored MDPs . . . . . . . . . . . . . . . . . . 15

2.4 Deterministic planning . . . . . . . . . . . . . . . . . . . . . . 17

3 Merge-and-Shrink Compression of MDPs 23

3.1 Stochastic transition graphs . . . . . . . . . . . . . . . . . . . 25

3.2 Abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Generic abstraction algorithm . . . . . . . . . . . . . . . . . . 29

3.3.1 Using abstraction as a controller for MDP . . . . . . . 32

3.4 Merge and shrink strategies . . . . . . . . . . . . . . . . . . . 33

3.4.1 Shrink strategies . . . . . . . . . . . . . . . . . . . . . 34

3.4.2 Merge strategies . . . . . . . . . . . . . . . . . . . . . 36

3.5 Edge abstraction . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 Experiments 43

4.1 Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1.1 Blocksworld . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1.2 Exploding blocksworld . . . . . . . . . . . . . . . . . . 44

4.1.3 Boxworld . . . . . . . . . . . . . . . . . . . . . . . . . 45

i

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

4.1.4 Triangle tireworld . . . . . . . . . . . . . . . . . . . . 45

4.1.5 Rectangle tireworld . . . . . . . . . . . . . . . . . . . . 46

4.1.6 Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.1.7 Search and rescue . . . . . . . . . . . . . . . . . . . . 47

4.1.8 Sysadmin . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Setup and environment . . . . . . . . . . . . . . . . . . . . . . 48

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5 Summary and Future work 54

Bibliography 55

Abstract in Hebrew א

ii

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

List of Figures

2.1 All instantiations of action AC in the tireworld example. The

labels on the transitions are the effect (partial variable assign-

ment), its probability, and its reward, in that order. . . . . . 12

2.2 Simulation of the value iteration algorithm on the running

Tireworld example. V0 is a set of arbitrary initial values (not

necessarily 0). Vi and πi are the iterating value and policy.

V5 = V6 means the algorithm has converged and that V6 =

V ∗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1 Transition graph of the tireworld problem. State 〈A, 1, 0〉means location = A, has spare = 1 and flat tire = 0. 〈A, 0, 0〉is the initial state of the problem. An edge label AC1, 1, 0.6

refers to effect 1 of action AC with reward 1 and probability

0.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Abstraction of transition graph from left to right where s(s1) =

s(s2) = s(s3) and s(t1) = s(t2). Note that action a becomes

two different actions as1 and as3 . Rewards and probabilities

are preserved. . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Atomic projection of the tireworld problem on variable location

πlocation on the left side and variable flat tire πflat tire on the

right side . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4 Synchronized product of two atomic projections of the tire-

world problem πlocation ⊗ πflat tire . . . . . . . . . . . . . . . . 29

3.5 Abstraction of TA = πlocation ⊗ πflat tire, first case. . . . . . . 31

3.6 Abstraction of TA = πlocation ⊗ πflat tire, second case. . . . . . 32

3.7 Final abstraction of full tireworld problem, first case. . . . . . 32

3.8 Final abstraction of full tireworld problem, second case. . . . 33

iii

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

3.9 Example of dominated actions. B = {b, c} dominates a. . . . 41

3.10 Merge-and-shrink algorithm extended with dominated actions

elimination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1 Results for the blocksworld domain. *Average was taken on

4/8 planners. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Results for the exploding blocksworld domain. *Average was

taken on 5/8 planners. . . . . . . . . . . . . . . . . . . . . . . 50

4.3 Results for the boxworld domain. *Average was taken on 2/8

planners. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.4 Results for the joint Tireworld domains as presented and

tested in the competition. Instances 1 to 10 are Triangu-

lar and 11 to 15 are Rectangular. *Average was taken on 4/8

planners. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.5 Results for the schedule domain. *Average was taken on 4/8

planners. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.6 Results for the search and rescue domain. *Average was taken

on 2/8 planners. . . . . . . . . . . . . . . . . . . . . . . . . . 53

iv

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

Abstract

The ability to plan a course of action is crucial for intelligent systems. It

involves the representation of actions and world models, reasoning about

the effects of actions, and techniques for efficiently searching the space of

possible plans. Planning under uncertainty is captured by decision-theoretic

planning (DTP), where the actions have stochastic effects, and the goal is to

devise a policy of acting with a high expected utility. This is in contrast to

deterministic planning, where a sequence of actions that translates an agent

from an initial state to a goal state is sought. The classic mathematical

framework for decision-theoretic planning is that of the Markov decision

process (MDP).

Research on AI planning, reasoning under uncertainty, and decision anal-

ysis and operations has given rise to the interesting insight that real-world

DTP problems typically exhibit considerable structure. One of the most

popular structures is exploited by the use of variable-based representations

to describe problems, as is common practice in planning. Such variable

based representations allow for compact description of the huge state space,

but cast doubt upon the viability of most standard solutions for MDPs, i.e.,

algorithms which compute the optimal policy assuming an explicit state

space.

Over the last two decades, some works have presented solutions for MDPs

with variable based representation. Factored MDPs (Boutilier et al. [10])

use implicit variable based state space and a dynamic Bayesian network

(DBN) as a compact representation of the transition model. Most works

have approximated the solution using an approximate value function with

compact representation given by a linear combination of possibly non-linear,

basis functions (Bellman et al. 1963 [3]; Sutton, 1988 [31]; Tsitsiklis et al.

1997 [35]). Guestrin et al. 2003 [16] proposed a similar solution, built on

1

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

the idea of Koller & Parr (1999, 2000) [23, 24] and using factored (linear)

functions, where each basis function is restricted to some small subset of

the state variables. Dean & Givan 1997 [13] offered a somewhat different

approach in which the model is compressed much the same way that a finite

state machine is reduced to its equivalent minimal final state machine.

While previous works on factored MDPs approximate the value function

of the original MDP, in this work we explore a different approach of calcu-

lating the exact value function of an approximated MDP. We exploit and

extend a technique known as over-approximating abstractions to approxi-

mately solve the exponential state space MDP problem.

An abstraction can, in general, be seen as a mapping that reduces the

size of the state space by compacting several states into one. If the ab-

stract state space is made small enough, the standard solutions for explicit

state space become feasible for it as well. Over-approximating abstractions,

adapted to the context of deterministic planning by Helmert, Haslum &

Hoffmann [17], are based on the merge-and-shrink methodology introduced

by Drager, Finkbeiner & Podelski [20]. We depart from these methodologies,

adapting the merge-and-shrink abstraction technique to devise compact con-

trollers for MDPs. We introduce the notion of action abstractions to extend

the merge-and-shrink abstraction technique for MDPs and for deterministic

planning. Finally, we provide a clear testbed evaluation for our methods

and compare them to other state-of-the-art approaches.

2

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

Abbreviations and Notations

V = {v1, ..., vn} — Set of variables

Dv — Finite domain of variable v

S — Set of all states

A — Set of actions where a ∈ A is a pair 〈pre, E〉pre — Partial assignment over VE — Effects of an action; a set of partial assignments over Vfs,a : E → S — fs,a(e) = s′ Transition function, s calculates the destination

state s′ of effect e of action a at state s

Pa — Probability of effect e to occur, denoted by Pa(e)Ra — Probability of effect e, denoted by Ra(e)Π = 〈V,A, s0,P,R〉 — PSAS+ MDP problem

P — Probability distribution for all actions; P(e) = Pa(e) where

e ∈ E and a = 〈pre, E〉 ∈ AR — Reward function for all actions; R(e) = Ra(e) where

e ∈ E and a = 〈pre, E〉 ∈ Aπ : S → A — Policy for a PSAS+ MDP problem; π(s) = a is

an action the agent should choose when at state s

V π(s) — Expected value at state s using policy π

V ∗(s) — Optimal expected value at state s

T = 〈S,L,A, so, S∗〉 — Transition graph

T = 〈T,R, P 〉 — Stochastic transition graph with rewards

T (Π) — Stochastic transition graph with rewards for a PSAS+

MDP problem

α — Abstraction function α : S → S′

TA — Abstract stochastic transition graph with rewards

N — Limit on number of states

M — Limit on number of edges

3

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

Chapter 1

Introduction

The ability to plan a course of action is crucial for intelligent systems, in-

creasing their autonomy and flexibility through the construction of sequences

of actions to achieve their goals. Planning has been studied in the context

of artificial intelligence for over three decades [27]. Planning techniques

have been applied to a variety of tasks, including robotics, process plan-

ning, Web-based information gathering, autonomous agents, and spacecraft

mission control. Planning involves representation of actions and world mod-

els, reasoning about actions’ effects, and devising techniques for efficiently

searching the space of possible plans [1, 36].

Deterministic planning focuses on translating the agent from one state to

another while assuming only one outcome for each action [1, 36]. A solution

for deterministic planning is a sequence of applicable actions from the initial

state to a goal state. An optimal solution minimizes the number of actions

needed to achieve the goal. A more general formalization of planning, which

includes planning under uncertainty, is decision-theoretic planning (DTP)

[7]. The aim of DTP is to form courses of action (plans or policies) that

have high expected utility rather than plans that are guaranteed to achieve

certain goals with a minimal number of actions. Most sequential decision

problems can be captured semantically by the Markov decision processes

(MDP) model [2, 19, 4, 25].

The classic approaches to solving MDP are well-known dynamic pro-

gramming algorithms such as value iteration [2] and policy iteration [19].

Those algorithms compute the optimal decision an agent has to make for

every state to obtain the highest expected value possible. Many other robust

4

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

methods for optimal policy construction have been developed in the opera-

tions research (OR) community, including modified policy iteration [26] and

asynchronous versions of the well-known value and policy iteration algo-

rithms [4, 6]. Common to all these methods is that they require explicit

enumeration of the underlying state space of the MDP.

Using MDPs as a model for solving planning problems has illuminated

a number of interesting connections between techniques for solving decision

problems. Those techniques came from AI planning, reasoning under un-

certainty, decision analysis and OR. One of the most interesting insights

emerging from this body of work is that real-world DTP problems typically

exhibit considerable structure, and thus can be solved using special-purpose

methods that recognize and exploit that structure. Variable based problem

representation is one of the most popular of these structures, and its use

is common practice in planning. While variable based representation high-

lights the problem’s special structure and allows it to be exploited computa-

tionally, it casts doubt on the viability of standard solutions for MDPs. The

standard solutions usually assume explicit state space, whereas for variable-

based problem representations the state space becomes exponential in the

number of variables.

Over the last two decades, some works have presented solutions for MDPs

with variable based representation. Factored MDPs (Boutilier et al. [10])

use implicit, variable-based state space and a dynamic Bayesian network

(DBN) that allow a compact representation of the transition model. Some

works have approximated the solution by approximating the value function

using a linear combination of potentially non-linear basis functions. This

technique was used by Bellman et al. 1963 [3]; Sutton, 1988 [31]; Tsitsiklis

et al. 1997 [35]. Guestrin et al. 2003 [16] used basis functions, with each

such function taking a small subset of the state variables as parameters.

Dean & Givan 1997 [13] proposed a somewhat different approach that is

based on minimizing the model in the same way that a finite state machine

is reduced to its equivalent minimal final state machine. Their algorithm

takes as input an implicit MDP model in factored form and tries to produce

an explicit, reduced model whose size is within a polynomial factor of the

size of the factored representation. The algorithm cannot guarantee that

the model’s size will always be reduced as required.

This work proposes a different approach for handling the exponential

5

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

state space. Instead of approximating the value function on the original,

large MDP as previous works do, we propose calculating the exact value

function of an approximated MDP, which we construct using all the state

variables of the large original MDP. The state space of the resulting ap-

proximated MDP is explicitly upper-bounded by a predefined parameter, in

order to enable the standard MDP algorithms to solve it optimally.

This idea was inspired by some recent advances in heuristic search for

deterministic planning. The standard, generally viable approach to solv-

ing deterministic planning problems is search in one form or another, with

heuristics being the most important general method for improving search ef-

ficiency. Heuristics are functions that approximately estimate the distance

to a goal state in the search space. These functions help in navigating

the search process. One method for devising a good heuristics is over-

approximating abstractions [21, 12, 14]. Very roughly, an abstraction is a

mapping that reduces the size of the state space by contracting several states

into one. By making the abstract space small enough, it becomes feasible to

perform various reachability-analysis tasks on it by using explicit methods

such as breadth-first search or Dijkstra’s algorithm. Those analyses on the

abstractions allow building viable heuristics for the full exponential state

space.

While abstractions in deterministic planning are used mainly (if not ex-

clusively) for deriving informative admissible estimates of the distances from

a state to the goal, our goal is to use abstractions to “compress” the state

space while preserving as much as possible those original problem proper-

ties that have the greatest influence on the solution. The compression is

parametrized by the resource limitations. At the extremes, this abstraction

will correspond to abstracting all the states to one state, and to abstracting

nothing at all. The interesting cases are, of course, in the middle, when

we do have some realistically small but still non-negligible memory, and the

task is to use it the best way possible.

On the technical side, we generalize the merge-and-shrink abstraction

technique introduced by Drager, Finkbeiner & Podelski [20] in the con-

text of verification of systems of concurrent automata, and further extended

and adapted to the context of deterministic planning by Helmert, Haslum

& Hoffmann [17]. The computational feasibility of this approach rests on

interleaving the composition of various system properties (a.k.a. state vari-

6

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

ables) with abstraction of the intermediate composites. As Helmert et al.

[17] show, it allows very accurate heuristics to be obtained from relatively

compact abstractions. The greater flexibility offered by not restricting ab-

stractions solely to projections on system properties is, however, a mixed

blessing. The already-hard problem of selecting—from the vast number of

possible abstractions—a good one, becomes even harder.

The contributions of our work are as follows. First, we provide rigorous

semantics for the merge and shrink operators on structured MDPs. Second,

we suggest effective and semantically justifiable strategies for both state

contraction (shrink) and state-space refinement (merge). We analyze the

relative attractiveness of the proposed strategies on different problems.

Sometimes abstracting the space state is not enough. In deterministic

planning, the number of possible effects is S · A, where S is the number of

states and A is the number of different actions. In DTP, each action may

have a number of outcomes, and thus the number of effects is even larger.

In addition to adopting state abstraction, we have extended the merge-and-

shrink technique both for MDPs and for deterministic planning with action

abstraction techniques. Those techniques allow us to cope more efficiently

with resource limitations by merging or even removing some effects from the

abstract model, sometimes without loss of any viable information.

Finally, we provide a clear testbed evaluation for our methods and com-

pare them to other state-of-the-art approaches. This empirical study of the

effectiveness of (approximately) solving structured MDPs is the main focus

of our work. In order to evaluate our algorithm, we used DTP tasks from

the fully observable probabilistic track of the 2008 international planning

competition [11]. Six domains were tested against eight algorithms from the

same competition. All the domains are MDPs, but some exhibit planning-

like goals and structure. In three of them, our approach exhibited better

performance than other state-of-the-art algorithms. Two planning-like do-

mains were very problematic for our approach. The other domains had

fair results. Overall, our approach appears comparable to state-of-the-art

algorithms.

7

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

Chapter 2

Background

A planning problem is usually given by a description of states and condi-

tioned transitions of some system. An initial state and set of goal states

are usually given. Each state has its own feasible set of actions. An action

can be applied only if it is in the feasible set of the system’s current state.

Those actions translate the system from one state to another. A solution to

a planning problem is a sequence of actions that translate the system from

the initial state into one of the goal states.

This work focuses on Decision-theoretic planning (DTP). The goal of

DTP is to form courses of action (plans or policies) in stochastic environ-

ments that have high expected utility rather than plans that translate the

system from its initial state to the goal state. Most sequential decision prob-

lems with full observability can be viewed as instances of Markov decision

processes (MDPs).

Definition 1 State space

1. V = {v1, ..., vn} is a set of state variables. Dv is a finite domain of

a variable v.

2. A partial variable assignment over V is a function s on a subset

of V such that s(v) ∈ Dv wherever s(v) is defined.

3. If s(v) is defined for all v ∈ V, s is called a state or full variable

assignment. The set of all states is denoted by S.

8

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

4. We say that a partial variable assignment p1 agrees with another par-

tial variable assignment p2 iff for every variable v such that p1(v) and

p2(v) are defined, p1(v) = p2(v).

An action is instigated by an agent in order to change the system’s state.

We assume that the agent has control over what actions are taken and when,

though the effects of taking an action might not be perfectly predictable.

We also assume that not all actions can be applied to every state.

Definition 2 Actions and transition function

A is a set of actions, where an action a is a tuple 〈pre, E,P〉

1. pre is a partial variable assignment called a precondition.

2. E is a set of partial variable assignments called effects.

3. P : E → [0..1] a probability distribution over E.

4. a is said to be applicable in state s if pre agrees with s. By As we

denote the set of actions applicable in state s.

Let there be a state s such that an action a = 〈pre, E,P〉 is applicable

at s. Then fs,a : E → S is a transition function such that, fs,a(e) = s′

where s′ agrees with e and for every variable v such that e is not defined,

s(v) = s′(v).

An agent can choose an applicable action at the current state, (possibly)

changing it to some other state. As mentioned, the likelihood of this change

is given by a stochastic transition function. This function uses the Markov

assumption, which says that knowledge of the present state renders infor-

mation about past states and agent choices irrelevant. Thus, the stochastic

transition function only depends on the state and the action chosen by the

agent and not on the previous states or choices.

In theory, the number of effects for each action is limited by |S| but in

practice we assume that the number of effects per action is limited by some

constant. Thus we define the stochastic transition function over the effects

of an action and not the destination states.

Rewards are used to evaluate agent performance. Generally the rewards

are defined over the entire state space but we will assume rewards which

9

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

are state independent. The reward function R associates a reward with the

outcome of performing an action a (which is one of a’s effects).

Definition 3 PSAS+ Markov decision process.

A PSAS+ Markov decision process or PSAS+ MDP for short is a

tuple Π = 〈V,A, s0,R〉 with the following components:

1. V is a set of state variables.

2. A is a set of actions

3. s0 ∈ S is an initial state.

4. R is a real-valued function from the union of all actions’ effects.

Example 1 Let us consider the following problem (based on the IPC 2008

tireworld domain, from the planning-under-uncertainty category [11]). You

have a car and you need to get from location A to C, which can be done

directly or through another place called B. Each time you travel you have a

chance of getting a flat tire. You are also allowed a spare tire, which you

can use to replace your flat tire, and you can pick up spare tires at place B.

You have to get to C without any flat tires. Getting to C from A will give

you 1 reward point, and getting to C from B will give you 1.5 points. You

cannot travel if you have a flat tire.

The appropriate PSAS+ MDP Π = 〈V,A, s0,R〉 is as follows:

1. V = {location, has spare, flat tire} where the last two are binary

variables and the location is one of {A,B,C}. We will use the (`, s, f)

notation, where f, s ∈ {0, 1, ?} and ` ∈ {A,B,C, ?}, to specify states

and partial variable assignments of the state space. For example, the

state (C, 0, 1) means you are at location C, you have no spare, and

you do have a flat tire. (C, ?, ?) is a partial assignment meaning that

the variable location is set to C and the others are not defined.

2. A = {AB,AC,BC,LT,CT} is the set of actions. AB, AC and BC

denote three travel actions, LT denotes the action of loading the tire

into the car, and CT denotes the action of changing a flat tire. Each

action a is tuple 〈pre,E,P〉.AB = 〈(A, ?, 0), {(B, ?, 1), (B, ?, 0)}, {P(B, ?, 1) = 0.4, P(B, ?, 0) =

10

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

0.6}〉AC = 〈(A, ?, 0), {(C, ?, 1), (C, ?, 0)}, {P(C, ?, 1) = 0.4, P(C, ?, 0) =

0.6}〉BC = 〈(B, ?, 0), {(C, ?, 1), (C, ?, 0)}, {P(C, ?, 1) = 0.4, P(C, ?, 0) =

0.6}〉LT = 〈(B, 0, ?), {(B, 1, ?)}, {P(B, 1, ?) = 1}〉CT = 〈(?, 1, 1), {(?, 0, 0)}, {P(?, 0, 0) = 1}〉.

3. s0 = 〈A, 0, 0〉.

4. RAB(B, ?, 1) = 0 and RAB(B, ?, 0) = 0

RAC(C, ?, 1) = 0 and RAC(C, ?, 0) = 1

RBC(C, ?, 1) = 0 and RBC(C, ?, 0) = 1.5

RLT (B, 1, ?) = 0

RCT (?, 0, 0) = 0.

Let us consider the action AC. The precondition of action AC is a

partial variable assignment (A, ?, 0), meaning that this action is applicable

in two states, (A, 0, 0) and (A, 1, 0). There are two effects for this action,

(C, ?, 1) and (C, ?, 0), meaning that its outcome has two possible destination

states. Using the transition function we see that

f(A,0,0),AC(C, ?, 1) = (C, 0, 1) and

f(A,0,0),AC(C, ?, 0) = (C, 0, 0).

The chance of the (C, ?, 1) transition/outcome is 0.4 and the reward gained

is RAC(C, ?, 1) = 0.

See figure 2.1 for an illustration of all possible outcomes of the AC action.

Our system evolves in stages, where a choice of the agent marks the

transition from one stage, t, to the next stage, t + 1. This is analogous to

the passage of time. The problem facing the decision maker is to select an

action to be performed at each stage.

Our objective is to maximize the total expected reward associated with

the course of actions, and we would like to evaluate the agent’s performance

over an unlimited number of stages. In this case the total reward may be

unbounded, meaning that any infinite sequence of effects could be arbitrarily

11

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

(A, 1, 0)

(A, 0, 0)(C, 0, 0)

(C, 0, 1)

(C, 1, 1)

(C, 1, 0)

(C, ?, 0), 0.6,+1

(C, ?, 1), 0.4, 0

(C, ?, 1), 0.4, 0

(C, ?, 0), 0.6,+1

Figure 2.1: All instantiations of action AC in the tireworld example. Thelabels on the transitions are the effect (partial variable assignment), its

probability, and its reward, in that order.

good or bad if it is executed for long enough. In this case it may be necessary

to adopt a different means of evaluation. The most common practice in this

respect is to introduce a discount factor. The discount factor ensures that

rewards gained at later stages are worth less than those gained at earlier

stages. Similarly to Bellman [2], we will define the expected total discount

reward value function:

V (seq) =

∞∑t=0

γtR(efft)

where seq = (eff1,eff2, ...) is a sequence of effects to occur and γ is a fixed

discount factor (0 < γ < 1). This formulation is a particularly simple and

elegant way to ensure a bounded measure of value over an infinite number

of stages.

Note that in classical planning, a solution corresponds to a sequence

of actions. This is not possible in the DTP case because the outcome of

performing an action is not known. Thus, the solution to our problems will

be a function from all states to actions.

Definition 4 Policy

12

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

A policy is a function π : S → A. If the current state is s, the agent is

prescribed by the policy to perform the action π(s).

Putting things together, in what follows we show the problem of building

a policy that maximizes the discounted sum of expected rewards over an

infinite number of stages. It is known that there always exists an optimal

policy for such problems [19]. Intuitively, we can see that this is the case

because no matter what stage the process is in, an infinite number of stages

remain. Thus the optimal action at any state is independent of the stage. In

the case of an infinite horizon, Howard showed [19] that the value function

of any policy π satisfies the following recurrence:

Definition 5 Value function of policy π

A value function V π : S → R for policy π is defined recursively:

V π(s) =∑e∈EP(e) ( Ra(e) + γV π(fs,a(e)) )

where π(s) = a, E is a set of effects, P is the probability distribution on

effects of that action a, and 0 < γ < 1.

The optimal value function satisfies a very similar recurrence:

Definition 6 Optimal value function

V ∗(s) = maxa∈As

∑e∈EP(e) ( Ra(e) + γV ∗(fs,a(e)) )

We would like the agent to adopt a policy that either maximizes this

expected value or, in a satisficing context, promises the acceptably high

expected value to be as high as possible.

2.1 Value iteration algorithm

Bellman showed [2] that the value of a fixed policy π can be evaluated using

successive approximations. This method is almost identical to the recurring

value function. His algorithm begins with an arbitrary assignment of values

to V π0 (s), after which it calculates the following step using the following

recurring function:

13

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

V πt+1(s) =

∑e∈EP(e) ( Ra(e) + γV π

t (fs,a(e)) ).

The sequence of functions V πt converges linearly to the true value function

V π.

This is called the value iteration algorithm. This algorithm can also

be altered slightly so that it builds optimal policies. The optimal version of

Bellman’s algorithm starts with a value function V0 that assigns an arbitrary

value to each s ∈ S. Given value estimate Vt(s) for each state s, Vt+1(s) is

calculated as:

Vt+1(s) = maxa∈As

∑e∈EP(e) ( Ra(e) + γVt(fs,a(e)) ).

The sequence of functions Vt converges linearly to the optimal value function

V ∗(s). After some finite number of iterations n, the choice of maximizing

action for each s forms an optimal policy π, and Vn approximates its value.

Example 2 Figure 2.2 illustrates the application of the value iteration al-

gorithm on the tireworld MDP from Example 1. We used γ = 0.9 as our

discount factor. As described in the algorithm above, Vi is value function

over a set of states. This function will converge to V ∗. In this example we

chose the initial value for all states to be 0. πi is the policy which maximizes

the value, and πi(s) is the maximum argument action from the recurring

function above at stage i. Some states don’t have an optimal action because

they are dead ends, i.e., As = ∅. Each line corresponds to one state.

2.2 (Modified) Policy iteration algorithm

As an alternative to value iteration, Howard [19]] introduced the policy-

iteration algorithm. Rather than iteratively improving the estimated value

function, the new algorithm modifies the policies directly. It begins with an

arbitrary policy π0, then iterates, computing πi+1 from πi. Each iteration of

the algorithm comprises two steps, policy evaluation and policy improvement:

1. (Policy evaluation) For each s ∈ S, compute the value function V πi(s)

based on the current policy πi.

2. (Policy improvement) For each s ∈ S, find the actions a∗ that maxi-

mize

14

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

State V0 π0 V1 π1 V2 π2 V3 π3 V4 π4 V5 π5 V6 = V ∗

(A,0,0) 0 AC 0.6 AC 0.6 AC 0.6 AB 0.748 AB 0.748 AB 0.748(A,0,1) 0 - 0 - 0 - 0 - 0 - 0 - 0(A,1,0) 0 AC 0.6 AC 0.6 AB 0.777 AB 0.777 AB 0.777 AB 0.777(A,1,1) 0 CT 0 CT 0.54 CT 0.54 CT 0.54 CT 0.673 CT 0.673(B,0,0) 0 BC 0.9 BC 0.9 BC 0.9 BC 0.9 BC 0.9 BC 0.9(B,0,1) 0 LT 0 LT 0 LT 0.729 LT 0.729 LT 0.729 LT 0.729(B,1,0) 0 BC 0.9 BC 0.9 BC 0.9 BC 0.9 BC 0.9 BC 0.9(B,1,1) 0 CT 0 CT 0.81 CT 0.81 CT 0.81 CT 0.81 CT 0.81(C,0,0) 0 - 0 - 0 - 0 - 0 - 0 - 0(C,0,1) 0 - 0 - 0 - 0 - 0 - 0 - 0(C,1,0) 0 - 0 - 0 - 0 - 0 - 0 - 0(C,1,1) 0 CT 0 CT 0 CT 0 CT 0 CT 0 CT 0

Figure 2.2: Simulation of the value iteration algorithm on the runningTireworld example. V0 is a set of arbitrary initial values (not necessarily

0). Vi and πi are the iterating value and policy. V5 = V6 means thealgorithm has converged and that V6 = V ∗.

Qi+1(a, s) =∑e∈EPs,a(e) ( Ra(e) + γV πi(fs,a(e)) ).

If Qi+1(a∗, s) > V πi(s), then πi+1(s) = a∗; otherwise πi+1(s) = πi(s).

The algorithm iterates until πi+1(s) = πi(s) for all states s.

The policy evaluation phase can be carried out in different ways. The

standard approach is by solving a system of linear equations. Another ap-

proach is to compute some (usually small) number of iterations of successive

approximations (i.e., value iteration for fixed policy π). Then, during the

policy improvement phase, the algorithm updates the policy according to the

scheme above. A generalization of this algorithm is called modified policy

iteration (Puterman & Shin, 1978) [26]. Both value-iteration and policy-

iteration are special cases of modified policy iteration, corresponding to the

number of iterations t in the policy evaluation phase, with the algorithm

performing value iteration at t = 0 and policy iteration at t =∞.

2.3 Planning for Factored MDPs

The field of MDPs was formalized by Bellman [2] in the 1950’s. The im-

portance of solving large MDPs was recognized by Bellman when he sug-

gested value function approximation [3]. Within the AI community, value

15

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

function approximation developed concomitantly with the notion of value

function representations for Markov chains. Sutton’s seminal paper on tem-

poral difference learning [31], which addressed the use of value functions for

prediction but not planning, assumed a very general representation of the

value function and noted the connection to general function approximators

such as neural networks. Several important developments gave the AI com-

munity deeper insight into the relationship between function approximation

and dynamic programming.

Tsitsiklis et al. [35] and, independently, Gordon [15] popularized the

analysis of approximate MDP methods via the contraction properties of

the dynamic programming operator and function approximator. Tsitsiklis

established a general convergence result for linear value function approx-

imators and TD(λ), and Bertsekas and Tsitsiklis [5] unified a large body

of work on approximate dynamic programming under the name of Neuro-

dynamic Programming, also providing many novel and general error analy-

ses. Approximate linear programming for MDPs using linear value function

approximation was introduced by Schweitzer and Seidmann [29].

Tatman and Shachter [33] considered factored approach of additive de-

composition of value nodes in influence diagrams. A number of approaches

to factoring of general MDPs have been explored in the literature. The use of

factored representations such as dynamic Bayesian networks was pioneered

by Boutilier et al. [9] and has developed steadily in recent years. These

methods rely on the use of context-specific structures such as decision trees

or analytic decision diagrams (ADDs) (Hoey et al., [18]) to represent both

the transition dynamics of the DBN and the value function. The algorithms

use dynamic programming to partition the state space, representing the par-

tition using a tree-like structure that branches on state variables and assigns

values at the leaves. The tree is grown dynamically as part of the dynamic

programming process and the algorithm creates new leaves as needed: A

leaf is split by the application of a DP operator when two states associ-

ated with that leaf turn out to have different values in the backprojected

value function. This process can also be interpreted as a form of model

minimization (Dean & Givan, [13]). The number of leaves in a tree used to

represent a value function determines the computational complexity of the

algorithm. It also limits the number of distinct values that can be assigned

to states: since the leaves represent a partitioning of the state space, every

16

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

state maps to exactly one leaf. However, as was recognized early on, there

are trivial MDPs which require exponentially large value functions. This ob-

servation led to a line of approximation algorithms aimed at limiting the tree

size (Boutilier & Dearden, [8]) and, later, limiting the ADD size (St-Aubin,

Hoey, & Boutilier, [30]). Kim and Dean (2001) also explored techniques for

discovering tree-structured value functions for factored MDPs. While these

methods permit good approximate solutions to some large MDPs, their com-

plexity is still determined by the number of leaves in the representation and

the number of distinct values than can be assigned to states is still limited

as well. Tadepalli and Ok [32] were the first to apply linear value function

approximation to Factored MDPs. Linear value function approximation is

a potentially more expressive approximation method because it can assign

unique values to every state in an MDP without requiring storage space that

is exponential in the number of state variables. Schuurmans and Patrascu

[28], based on our earlier work on max-norm projection using cost networks

and linear programs, independently developed an alternative approach to

approximate linear programming using a cost network. Later Guestrin et

at. [16] embedded a cost network inside a single linear program. By contrast,

Guestrin et al. method is based on a constraint generation approach, using

a cost network to detect constraint violations. When constraint violations

are found, a new constraint is added, repeatedly generating and attempting

to solve LPs until a feasible solution is found.

Our approach is different, instead of approximating the value function

on the original, large MDP as previous works do, we propose calculating the

exact value function of an approximated MDP, which we construct using all

the state variables of the large original MDP. The state space of the resulting

approximated MDP is explicitly upper-bounded by a predefined parameter,

in order to enable the standard MDP algorithms to solve it optimally.

2.4 Deterministic planning

The value iteration and (modified) policy iteration algorithms for solving

MDPs both require explicit enumeration of the underlying state space. How-

ever, our state space grows exponentially with the number of state variables,

making the above algorithms virtually useless. Deterministic planning usu-

ally has the same property of exponential state space. A deterministic plan-

17

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

ning problem is usually given by variable based representation of the state

space of some system, initial state, and set of goal states. Each state defines

its own feasible set of actions. Actions translate the system from one state

to another while assuming always one outcome for each action. A solution

to a deterministic planning problem is a sequence of actions that can be

performed to translate the system from the initial state into one of the goal

states. An optimal solution minimizes the total cost of the actions along the

goal-achieving action sequence.

Definition 7 Deterministic planning task

A deterministic planning task is a tuple Π = 〈V,A, s0, s∗〉 with the

following components:

1. V is a set of state variables.

2. A is a set of actions, such that each action a is a pair 〈pre,eff〉 where

both are partial variable assignments.

3. s0 ∈ S is an initial state.

4. s∗ is a partial variable assignment, such that a state s is a goal state

if it agrees with s∗.

Solutions to planning problems are paths from the initial state to a goal

state in the transition graph.

Definition 8 Transition graph

A transition graph is a tuple T = 〈S,L,A, s0, S∗〉 where

1. S is a finite set of states.

2. L is a finite set of transition labels.

3. A ⊆ S × L× S is a set of (labeled) transitions also called edges.

4. S∗ is a set of goal states.

A path from s0 to s∗ is a plan. A plan is optimal iff its length is minimal.

18

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

One way to build heuristics that more efficiently estimate the distance

to a goal state is to use optimal solutions to a relaxation of the problem,

which is easier to solve than the original.

Problem relaxation may mean ignoring some of the problem constraints.

Helmert, Haslum & Hoffmann [17] showed a general algorithm for creating

consistent heuristics for deterministic planning. Their heuristic is the op-

timal cost of the solution to an abstraction generated by generalizing the

“merge-and-shrink” abstraction technique introduced by Drager, Finkbeiner

& Podelski [20].

A general abstraction is a mapping that reduces the size of the state

space by ”collapsing” several states into one abstract state. Projections

are a form of abstraction heuristics that ignore completely all but a subset

of the state variables of the planning task. States that do not differ on the

chosen variables are ”collapsed” together in the abstract space. The merge-

and-shrink technique allows composition of general state abstractions, which

include but are not limited to projections.

Definition 9 An abstraction of a transition graph T = 〈S,L,A, so, S∗〉 is

a pair 〈T ′, α〉, where T ′ = 〈S′, L′, A′, s′o, S′∗〉 and α : S → S′, such that:

1. L′ = L.

2. 〈α(s), e, α(s′)〉 ∈ A′ for all 〈s, e, s′〉 ∈ A.

3. s′0 = α(s0).

4. α(s∗) ∈ S′∗ for all s∗ ∈ S∗.

Definition 10 Synchronized product - composition of abstractions.

Let there be two abstractions of transition graphs T ′ = 〈〈S′, L,A′, s′0, S′∗〉, α′〉and T ′′ = 〈〈S′′, L,A′′, s′′0, S′′∗ 〉, α′′〉. The synchronized product T ′ ⊗ T ′′ is

T = 〈〈S,L,A, s0, S∗〉, α〉, where

1. S = S′ × S′′.

2. 〈(s′, s′′), e, (t′, t′′)〉 ∈ A iff 〈s′, e, t′〉 ∈ A′, 〈s′′, e, t′′〉 ∈ A′′.

3. s0 = (s′0, s′′0).

4. α(s) = (α′(s), α′′(s)).

19

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

5. (s′∗, s′′∗) ∈ S∗ for every s′∗ ∈ S′∗ and s′′∗ ∈ S′′∗ .

Helmert et al. [17] formalized projections and abstractions for SAS+

representation of planning tasks. They proved that composition of orthog-

onal projections of all variables is isomorphic to the transition graph of the

entire planning task. Their generalized model may produce as a special case

projections of the planning task but this model also allows composition of

partial abstractions. The greater flexibility offered by not restricting ab-

stractions to projections has pros and cons. While very accurate, and while

relevant heuristics with small abstractions can be generated with this ap-

proach, the problem of selecting a good abstraction from a huge number of

abstractions becomes even harder.

The algorithm of Helmert et al. computes abstractions. It maintains

a pool of (orthogonal) abstractions, which initially consists of all atomic

projections. Atomic projections are projections that ignore all but one state

variable. Starting with this pool, the algorithm performs one of two possible

operations until a single abstraction remains: it merges (i.e., composes) two

abstractions by replacing them with their synchronized product (similar to

the product of automatons) or it shrinks an abstraction by replacing it with

a homomorphism of itself (”collapsing” several states into one).

To keep time and space requirements, the algorithm enforces an explicit

limit on the size of the computed abstractions, which is specified as an

input parameter N . Before a product of two abstractions is computed, the

algorithm shrinks one or both of the abstractions such that their product

will not exceed N . Assuming that N is polynomially bounded by the input

size, and that the abstraction and merging strategy are computed efficiently,

the algorithm requires only polynomial time and space.

The merging strategy suggested and evaluated by Helmert et al. in [17] is

called linear merging. It maintains a single non-atomic abstraction, called a

current abstraction. Initially it is also an atomic projection. In each iteration

this abstraction is merged with a different atomic projection. The strategy

defines an order in which projections are merged using two rules, both of

which are based on a causal graph, which is a directed graph derived from

a planning task. The nodes of this graph are variables and the semantics

of an edge from variable v to v′ is that some value v is required to cause

some changes to v′. The first rule is to choose, if possible, a variable from

which there is an edge in the causal graph to one of the previously added

20

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

variables. The second rule, applied when the first one cannot be, is to add

a variable for which a goal value is defined.

The shrinking strategy is used to keep the size of the synchronized prod-

uct of two abstractions A⊗ πv below the bound N . Because the algorithm

uses a linear merging strategy, the current abstraction A will always be

shrunk. The current abstraction size should be set to Nsize(πv)

. The current

abstractions may be shrunk to this size by a sequence of combinations of

two abstract states s and s′ to one state {s, s′} each time.

Combining pairs of states at random may cause nonexisting short-cuts

when state s, which is close to an initial state, is combined with state s′,

which is close to a goal state. Thus, Helmert et al. use the following shrink-

ing strategy in their algorithm: they define the h-value of an abstract state

s to be the length of a shortest path from s to the closest abstract goal

state, in the abstract transition graph. Similarly, the g-value of s is defined

to be the length of a shortest path from the abstract initial state to s, and

the f-value is the sum of both. Helmert’s algorithm tries to preserve h and

g values because the f-value of an abstract state is a lower bound on the

f-value associated with the corresponding node in A∗ search. The A∗ al-

gorithm expands all search nodes n with f(n) < L∗ and no search node

with f(n) > L∗, where L∗ is the optimal solution length for the task. Thus,

abstract states with high f-values are expected to be encountered less often

during search. Therefore, combining them is less likely to lead to a loss of

important information.

In keeping with the above intuition, the algorithm of Helmert et al.

uses the following shrinking strategy. First, it partitions all abstract states

into buckets. Two states are placed in the same bucket iff their g and h

values are identical. We say that bucket B is more important than bucket

B′ iff the states in B have a lower f-value than the states in B′. For tie

breaking, the algorithm uses a higher h-value. The algorithm selects the

least important bucket and combines two of its states, chosen uniformly at

random. Otherwise, if all buckets contain exactly one state, it combines the

states in the two least important buckets.

The algorithm of Helmert et al. has been evaluated empirically against

state-of-the-art optimal planners [17, 22]. The domains were taken from the

International Planning Competition (IPC) from different years [17, 22]. The

results show that this approach is more than competitive with the current

21

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

state-of-the-art cost-optimal planners.

22

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

Chapter 3

Merge-and-Shrink

Compression of MDPs

Value iteration and policy iteration algorithms compute the optimal decision

an agent has to make for every state to obtain the highest expected value.

Those algorithms require explicit enumeration of the underlying MDP state

space. In structured MDPs, state space grows exponentially with the num-

ber of variables; for as few as 40 Boolean variables we will have to store 240

values in order to evaluate the value function with the algorithms described

above. This does not include the time complexity, which directly depends on

the number of transitions—limited in our case by k|A||S|, where k = O(1)

is an upper bound on the number of effects per action.

Another approach is thus required to solve structured MDPs with vari-

able based representation. Over the last decade, approaches in that direction

have been proposed. Factored MDPs (Boutilier et al. [10]) use implicit vari-

able based state space. A dynamic Bayesian network (DBN) can then allow

a compact representation of the transition model. Some works have ap-

proximated the solution using an approximate value function with compact

representation. A common choice is to use the linear value function as an

approximation value function that is a linear combination of possibly non-

linear basis functions (Bellman et al. 1963 [3]; Sutton, 1988 [31]; Tsitsiklis

et al. 1997 [35]).

Guestrin et al. 2003 [16] proposed an interesting variant of that ap-

proach. Built on the idea of Koller & Parr (1999, 2000) [23, 24], this

23

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

approach uses factored (linear) functions, where each basis function is re-

stricted to some small subset of the state variables. Those functions are

assumed to be part of the input, together with the problem definition. They

proposed two algorithms for solving such MDPs and introduced a novel lin-

ear programming decomposition technique used by both. This technique

reduces a structured linear programming problem with exponentially many

constraints to equivalent, polynomially-sized ones.

Dean & Givan 1997 [13] offered a somewhat different approach that

is based on minimizing the model much the same way that a finite state

machine is reduced to its equivalent minimal final state machine. Their

algorithm takes as input an implicit MDP model in factored form and tries to

produce an explicit, reduced model whose size is within a polynomial factor

of the size of the factored representation. The algorithm cannot guarantee

that the model’s size will always be reduced to the required size, but an

optimal solution to the reduced model is the same as for the original model.

The aforementioned works try to solve factored MDPs by approximating

the value function on the basis of various assumptions about its structure.

Dean & Givan 1997 [13] reduce the MDP model so it can be evaluated

optimally, but they cannot guarantee success. Our approach, outlined in

what follows, takes a different course: we calculate the exact value function

of an approximated MDP, using all the state variables, in polynomial time.

In particular,

• we don’t assume any structure of the value function;

• we can abstract our model to any constant, thus allowing us to solve

the abstracted model optimally;

• we guarantee that the abstraction process is polynomial in the size of

the structured problem description;

• the approximated value function is guaranteed to be an admissible

estimate of the true value function;

Our approach adapts the merge-and-shrink abstraction idea to the se-

mantics of MDPs by

• providing semantics for merge and shrink operators for the MDP tran-

sition graph;

24

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

• generalizing the synchronized product operation;

• defining abstraction for stochastic transition graphs with rewards;

• providing meaningful strategies for merging and shrinking stochastic

transition graphs with rewards.

3.1 Stochastic transition graphs

The semantics of a PSAS+ MDP is given by mapping it to a stochastic

transition graph with rewards.

Definition 11 A stochastic transition graph with rewards is a tuple

〈T,R, P 〉 where

1. T = 〈S,L,A, s0〉 is a transition graph.

2. R : A→ R is a real-valued reward function from transitions.

3. P : A→ R is a real-valued function from transitions such that Ps(e) =

P (s, e, s′) is a probability function where s is one state and e ∈ E is a

set of effects of one action a.

Note that we omit the set of goal states S∗ from the transition graph

because it is irrelevant for MDPs. We denote the stochastic transition graph

with rewards associated with a PSAS+ MDP Π = 〈V,A, s0,R〉 by T (Π) =

〈T,R, P 〉, where T = 〈S,L,A, s0〉 such that:

1. S = S.

2. L = {e|〈pre, E〉 ∈ A, e ∈ E} is the set of all effect labels. Similar effects

of distinct actions will have different effect labels.

3. 〈s, e, s′〉 ∈ A iff e ∈ E, 〈pre, E,P〉 ∈ A, pre agrees with s and fs,a(e) =

s′.

4. P (〈s, e, s′〉) = P(e), such that 〈pre, E,P〉 ∈ A and e ∈ E.

5. R(〈s, e, s′〉) = R(e).

25

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

A, 1, 1

A, 1, 0

A, 0, 0

A, 0, 1

B, 0, 1B, 0, 0

start

B, 1, 0

B, 1, 1

C, 0, 0

C, 0, 1

C, 1, 1

C, 1, 0

CT, 0, 1

CT, 0, 1

AB1, 0, 3/5

CT, 0, 1

BC1, 1, 3/5

AB2, 0, 2/5

BC2, 0, 2/5

AC2, 0, 2/5

AC1, 1, 3/5

BC1, 1, 3/5

AC2, 0, 2/5

BC2, 0, 2/5

LT, 0, 1

AC1, 1, 3/5

AB1, 0, 3/5

LT, 0, 1AB2, 0, 2/5

Figure 3.1: Transition graph of the tireworld problem. State 〈A, 1, 0〉means location = A, has spare = 1 and flat tire = 0. 〈A, 0, 0〉 is the

initial state of the problem. An edge label AC1, 1, 0.6 refers to effect 1 ofaction AC with reward 1 and probability 0.6.

See Figure 3.1 for the transition graph of the tireworld problem.

Note that we can similarly denote a PSAS+ MDP associated with a

stochastic transition graph with rewards T by Π(T ).

Definition 12 Let there be a stochastic transition graph with rewards T .

The value of state s of T is the value of the same state s of PSAS+ MDP

Π(T ).

Note that one state can be a source state of two transitions which have

identical labels (effects of an action) but two different target states. We

exploit this flexibility later in our shrinking mechanism. In practice, the

number of transitions substantially dominates the number of states, and

thus maintaining transitions consumes most of the memory.

3.2 Abstractions

Abstractions of transition graphs are the core of our approach to construct-

ing a small approximation of our MDP. Abstraction implies that some in-

formation or some constraints will be ignored in order to obtain a smaller

representation of the transition graph. Formally, we define abstractions of

transition graphs as follows:

Definition 13 Abstraction.

26

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

An abstraction of a stochastic transition graph with rewards T = 〈T,R, P 〉is a tuple TA = 〈T ′, R′, P ′, α〉 where α : S → S′ is a function called abstrac-

tion mapping.

1. T ′ is the abstraction of the transition graph T with different labeling:

〈α(s), es, α(s′)〉 ∈ A′ for all 〈s, e, s′〉 ∈ A;

2. R′(α(s), es, α(t)) = R(s, e, t);

3. P ′(α(s), es, α(t)) = P (s, e, t).

Definition 14 Abstract value function

Let Π be a PSAS+ task with state set S, and let TA = 〈T,R, P, α〉 be

an abstraction of its stochastic transition graph with rewards T (Π). The

abstract value function VT (s) is the function which assigns to each state

s ∈ S the value of α(s) of T.

When two states are merged, and same action is applicable in both. The

actions will be considered different in the abstraction. The effects will have

the same probability and reward, but will be considered to be of two different

actions. See Figure 3.2 for an example. When both actions have the same

source and all the effects have the same destination states accordingly, we

can dispose of one of the actions. This will not change the value function.

Note that our notion of abstraction is transitive.

Our abstraction mechanism is based on projection abstractions, formally

defined as follows:

Definition 15 Projection

Let Π = 〈V,A, s0,R〉 be a PSAS+ MDP with state set S, and let V ⊆ Vbe a subset of its variables.

A homomorphism on a stochastic transition graph with rewards T defined

by a mapping α such that α(s) = α(s′) iff s(v) = s′(v) for all v ∈ V is called

projection onto variable set V , denoted by πV .

If V is a singleton set, π{v} is called an atomic projection, also denoted

by πv.

The abstractions on which we base our approximation are constructed

by interleaving a composition of abstractions with further abstraction of the

27

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

s3

s2

s1

t2

t3

t1

t4

a3, 3, 4/9

a1, 1, 2/9

a2, 2, 3/9

a3, 3, 4/9

a2, 2, 3/9

a1, 1, 2/9s1, s2, s3

t3

t4

t1, t2

as11 , 1, 2/18

as12 , 2, 4/18

as31 , 1, 2/18

as13 , 3, 3/18

as33 , 3, 3/18

as32 , 2, 4/18

Figure 3.2: Abstraction of transition graph from left to right wheres(s1) = s(s2) = s(s3) and s(t1) = s(t2). Note that action a becomes twodifferent actions as1 and as3 . Rewards and probabilities are preserved.

composites. Composing here means extending, by means of probabilities

and rewards, the standard operation of forming the synchronized product.

The extended synchronized product is defined as follows:

Definition 16 Synchronized product

Let there be two abstractions of stochastic transition graphs with rewards,

T ′A = 〈T ′, R′, P ′, α′〉 and T ′′A = 〈T ′′, R′′, P ′′, α′′〉.The synchronized product of T ′A and T ′′A is defined as T ′A⊗T ′′A = 〈T,R, P, α〉,

where

1. T = T ′ ⊗ T ′′;

2. R(〈(s′, s′′), e, (t′, t′′)〉) = R′(s′, e, t′);

3. P (〈(s′, s′′), e, (t′, t′′)〉) = P ′(s′, e, t′).

Definition 17 Relevant variables, orthogonal abstractions

Let Π be a PSAS+ MDP with variable set V, and let TA be an abstraction

of T (Π). We say that TA depends on variable v ∈ V iff there exist states s

and s′ with α(s) 6= α(s′) and s(v′) = s′(v′) for all v′ ∈ V \ {v}. The set of

relevant variables for TA, denoted by varset(TA), is the set of variables in Von which TA depends.

Abstractions TA and T ′A are orthogonal iff varset(TA) ∩ varset(T ′A) = ∅.

28

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

A, ?, ? C, ?, ?

B, ?, ?

start

BC1, 1, 3/5

BC2, 0, 2/5

CT, 0, 1

LT, 0, 1CT, 0, 1

CT, 0, 1AB1, 0, 3/5

AB2, 0, 2/5AC1, 1, 3/5

AC2, 0, 2/5

start ?, ?, 0 ?, ?, 1

CT, 0, 1

AB1, 0, 3/5

AC1, 1, 3/5

BC1, 1, 3/5

LT, 0, 1

AB2, 0, 2/5

AC2, 0, 2/5

BC2, 0, 2/5

LT, 0, 1

Figure 3.3: Atomic projection of the tireworld problem on variable locationπlocation on the left side and variable flat tire πflat tire on the right side

A, ?, 1

A, ?, 0

start

B, ?, 0

B, ?, 1

C, ?, 1

C, ?, 0CT, 0, 1

BC2, 0, 2/5

CT, 0, 1

CT, 0, 1

LT, 0, 1

LT, 0, 1

AB1, 0, 3/5

AB2, 0, 2/5

AC2, 0, 2/5

AC1, 1, 3/5

BC1, 1, 3/5

Figure 3.4: Synchronized product of two atomic projections of thetireworld problem πlocation ⊗ πflat tire

The synchronized product of two orthogonal abstract stochastic transi-

tion graphs with rewards of T (Π) is an abstract stochastic transition graph

with rewards of T (Π). Clearly, projections satisfy varset(πV ) = V , so projec-

tions onto disjoint variable sets are orthogonal. Moreover, varset(TA⊗T ′A) =

varset(TA) ∪ varset(T ′A).

In other words, the synchronized product of all atomic projections {πv|v ∈V} of a PSAS+ task Π is equal to the full transition graph T (Π).

3.3 Generic abstraction algorithm

Atomic projections and synchronized products can fully capture the state

transition semantics of PSAS+ MDP. However, for interesting problems we

cannot compute the products of all atomic projections, as the size of the

29

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

Generic algorithm compute-abstraction (Π, N);abs←− {πv|v ∈ V \ Vinit};TA ←− πVinit ;while |abs| > 0 do

Select refinement projection πv ∈ abs;Contract TA until size(TA) · size(πv) ≤ N ;abs←− abs \ {πv};Refine: TA ←− TA ⊗ πv;

endreturn A

Algorithm 1: Algorithm for computing an abstraction for PSAS+

MDP Π, with a bound N on the number of abstract states

transition graph grows exponentially in the number of atomic projections in

the synchronized product. When the graph becomes too large to be stored

in memory, we must shrink the created abstraction by replacing it with a

homomorphism of itself, in order to create an abstraction that includes all

the problem variables.

The goal of the algorithm is to bound the abstract transition graph to a

constant size. At the same time, values calculated on the abstract transition

graph should be as similar as possible to the original transition graph values

(values of original transition graphs are mapped to the abstract transition

graph using the α functions as described in the next section). This depends

in turn on the merging and shrinking strategies.

• The merging strategy corresponds to the order in which the cross-

product operations are carried out. We would like to choose those

variables which will help us contract states while decreasing the accu-

racy of our solution as little as possible.

• The shrinking strategy corresponds to choosing which states should be

combined (i.e., abstracted). We would like to choose states to contract

such that the solution of the abstraction will be as accurate as possible.

Example 3 Figure 3.1 depicts the transition graph of the tireworld PSAS+

MDP. Each label corresponds to one effect of the PSAS+ MDP. The edge

labels have 3 parts: the action name with the effect number, the probability of

30

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

B, ?, ?

A, ?, 1

A, ?, 0

start C, ?, 1

C, ?, 0CT, 0, 1

CT, 0, 1

LT, 0, 1CT, 0, 1

AB1, 0, 3/5

AB2, 0, 2/5

AC2, 0, 2/5

BC2, 0, 2/5

AC1, 1, 3/5

BC1, 1, 3/5

Figure 3.5: Abstraction of TA = πlocation ⊗ πflat tire, first case.

that effect to occur, and a reward. Figure 3.3 depicts the abstract transition

graphs of the original PSAS+ MDP. The graphs are atomic projections on

the variables flat tire and location.

Now we will simulate the generic algorithm. The algorithm inputs are

the tireworld problem MDP Π from example 1 and N = 10 as the limit on

the number of states.

As the first abstraction we select TA = πlocation; see Figure 3.3. The

two other abstractions, πflat tire and πhasspare, are members of the abs set.

Then we select πflat tire ∈ abs; we don’t contract any states, because 3× 3 ≤10. Next we remove πflat tire from abs and update TA by computing the

synchronized product of TA ⊗ πflat tire; the result is shown in Figure 3.3.

Now we select the last πhas spare ∈ abs and empty abs. But before we update

TA, we will compute the contraction operation (because 6 × 2 > 10). We

will calculate an abstract transition graph using the following abstraction

function α:

1. First case - α(〈B, ?, 0〉) = α(〈B, ?, 1〉) = 〈B, ?, ?〉;

2. Second case - α(〈A, ?, 0〉) = α(〈C, ?, 1〉) = 〈A, ?, 0C, ?, 1〉.

We obtain as a result the abstract graphs shown in Figures 3.5 and 3.6.

Now we update TA by computing the synchronized product with the last

projection πhas spare and obtain the final abstract transition graph of our

PSAS+ MDP, shown in figures 3.7 and 3.8.

31

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

A, ?, 0 C, ?, 1

A, ?, 1

start

B, ?, 0

B, ?, 1

C, ?, 0

CT, 0, 1

AB1, 0, 3/5

AC2, 0, 2/5

AB2, 0, 2/5LT, 0, 1

CT, 0, 1

LT, 0, 1

BC2, 0, 2/5BC1, 1, 3/5

AC1, 1, 3/5

CT, 0, 1

Figure 3.6: Abstraction of TA = πlocation ⊗ πflat tire, second case.

A, 1, 1

A, 1, 0

A, 0, 0

A, 0, 1

start

B, 0, ?

B, 1, ?

C, 0, 0

C, 0, 1

C, 1, 1

C, 1, 0

CT, 0, 1

CT, 0, 1

BC2, 0, 2/5

BC1, 1, 3/5

LT, 0, 1

BC1, 1, 3/5

AC2, 0, 2/5

CT, 0, 1

AC1, 1, 3/5

AC2, 0, 2/5

AB1, 0, 3/5

AB2, 0, 2/5

AC1, 1, 3/5

BC2, 0, 2/5

AB1, 0, 3/5

AB2, 0, 2/5

Figure 3.7: Final abstraction of full tireworld problem, first case.

3.3.1 Using abstraction as a controller for MDP

What remains is to calculate the value function for the abstract transition

graph. This can be done using standard value calculation methods similar

to the ones in example 2. Then, using the abstract value function, we can

approximate the value of any state of the original MDP.

This is done by storing all the α mapping functions we used to abstract

our transition graph. Each merge operation requires α, the abstraction

mapping function. We represent it as a table of size O(N). There are |V|variables and thus at most |V| merge operations. Thus the mapping can

be stored in O(N |V|) space. Every concrete state in the original MDP has

its mapping in the abstraction. The tables allow us to retrieve the value

32

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

A, 1, 1

B, 0, 0

A, 0, 1

B, 0, 1

A, 1, 0 C, 1, 1

start

B, 1, 0

B, 1, 1

C, 0, 0

C, 1, 0A, 0, 0 C, 0, 1

CT, 0, 1

BC1, 1, 3/5

BC2, 0, 2/5

AB2, 0, 2/5

CT, 0, 1

LT, 0, 1

AB1, 0, 3/5

BC1, 1, 3/5

AC2, 0, 2/5

AB1, 0, 3/5

BC2, 0, 2/5

AB2, 0, 2/5

AC1, 1, 3/5

LT, 0, 1

AC1, 1, 3/5

AC2, 0, 2/5

CT, 0, 1

Figure 3.8: Final abstraction of full tireworld problem, second case.

of the abstract state for every concrete state mapped to it in O(|V|) time.

Furthermore, given a concrete state, we can estimate the best action in the

following way. For each action applicable in this concrete state. We retrieve

values of all its successor states and calculate the expected reward. The

action with the highest expected reward is considered best.

3.4 Merge and shrink strategies

The two key parameters to the general merge-and-shrink framework are the

strategies for which states shall be abstracted and in what order, and in what

order the partial abstractions should be combined. The general guidelines

for these two strategies in the case of MDPs are based on the following two

properties of our merge and shrink operators:

1. State contraction at shrink only extends the node to node stochastic

reachability within the abstract transition system, and

2. Stochastic paths eliminated from the abstract transition system at

merge correspond to shortcuts that are not present in the original

transition system.

These two properties are exemplified in Examples 4 and 5.

Example 4 In this example we assume that all transitions correspond to

different actions, each having a single effect. Contraction of states s3 and

s4, resulting in replacing them with a new abstract state s′, creates a new

path in the transition system from s1 to s5. In particular, that means that

33

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

the value of s′ will be the maximum of the original values of s3 and s4, and

that values of s1 and s2 as a result of this contraction can only increase.

s4

s1

s2

s5

s6

s3 s1

s2 s5

s6

s′

Example 5 In the tireworld Example 1, consider the projection on the

location variable (Figure 3.3, left side). Consider now the state (A, ?, ?),

abstracting four original states, and the edge from (A, ?, ?) to (B, ?, ?) la-

beled with AB (that is, with the action “move from A to B”). Note that this

action is inapplicable in the state (A, 0, 1) because one cannot drive with a

flat tire. Thus, this edge corresponds to a shortcut with respect to the origi-

nal transition system. After the synchronized product of this projection and

the atomic projection πflat tire, we distinguish between the states (A, ?, 1)

and (A, ?, 0) (see Figure 3.4), and the edge labeled with AB is now outgoing

only from (A, ?, 0).

The two aforementioned properties of our merge and shrink operators

ensure that, at each step of the merge and shrink processes, the values of

states in the created abstractions always upper-bound (or over-approximate)

the values of their prototype states in the original problem. That is, for

each state s of our MDP, and each intermediate abstraction T created by

our merge and shrink operators, we have VT (α(s)) ≥ V ∗(s). A construc-

tive conclusion from that is that the merge and shrink operators should be

chosen with the goal of lowering the abstract value function. Likewise, if

MDP solving is devoted to a specific initial state, then, at any stage of the

merge-and-shrink process, we can safely eliminate all the components of the

abstract transition systems that are not reachable from the abstract initial

state.

3.4.1 Shrink strategies

Given a pair of an abstract transition system TA and an atomic projection πvto be merged, the purpose of the shrinking stage is to reduce the size of TAso that the result of the synchronized product between TA and and πv will

34

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

fit the memory bound, expressed via a fixed bound N on the number of ab-

stract states. Hence, the process will involve n = |TA| − b N|πv |c| contractions

of some pairs of states. The high-level objective for the shrinking process

explained above brought us to suggest and evaluate two simple strategies for

selecting pairs of states for contraction, called below minimum-value con-

traction and minimum-difference contraction. As a baseline for evaluation

we used random shrinking in which pairs of states for contraction are se-

lected purely at random; both our not random strategies outperformed the

random shrinking by a vast margin.

1. Minimum value contraction. The rationale behind this strategy is

related to the “high -f -values first” strategy of Helmert et al. for deter-

ministic planning. As our abstractions consistently over-approximate

the state values in the original system, abstract states with (relatively)

low values abstract the states of the original system that are not likely

to be visited along the optimal policy. Hence, contracting states with

lower values corresponds to coarsening the abstractions in the regions

of the transition system that are less likely to be relevant at the ex-

ecution stage. The minimum-value strategy thus simply corresponds

to

(a) computing the value function of the abstraction TA, and

(b) contracting n pairs of states with the lowest values.

In our experiments, even though minimum-value contraction substan-

tially outperformed the random contraction baseline, it resulted in con-

tracting many pairs of states with (relatively low but) quite different

values. As combining states with different values implies propagation

of higher values to the predecessors of these states, such contractions

may substantially increase the values of some abstract states in the

system. This aspect of the shrinking process in MDPs brought us to

consider an alternative strategy.

2. Minimum difference contraction. The intuitive assumption be-

hind the minimum-difference strategy is that abstract states with sim-

ilar values are more likely to represent states with similar values in the

original system. In particular, combining abstract states with identi-

35

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

cal values will not increase our estimate of the value function. Given

that, the minimum-difference strategy comprises

(a) computing the value function of the abstraction TA,

(b) contracting n pairs of states s, s′ with the lowest difference |VTA(s)−VTA(s′)|, with the ties being broken towards the lower values of

the value function.

As we expected, this strategy in our experiments not only outper-

formed the purely random baseline, but also the minimal-value strat-

egy inspired by the decisions of Helmert et al. for the case of deter-

ministic planning.

3.4.2 Merge strategies

A merge step of the merge-and-shrink algorithm corresponds to replacing the

running abstraction TA and an atomic projection πv with the synchronized

product of the two. While the running abstraction is unique, πv should

be selected among the different atomic projections, and this choice may

dramatically affect the quality of the final abstraction. In these terms,

the overall merge-and-shrink algorithm can be seen as parametrized with

a merging strategy, corresponding to the order in which the atomic projec-

tions are merged with the incrementally constructed running abstraction.

Here as well, the high-level expectations from a successful merge operation

brought us to suggest and evaluate two simple merging strategies, called

below variable lookahead and symmetric variable refinement. As a baseline

for evaluation we used random selection of the atomic projections which, as

expected, leads to very poor abstractions of the original MDPs.

1. Variable lookahead. This strategy corresponds to a simple myopic op-

timization of the choice of atomic abstraction—the chosen one maximizes

the value of the initial state in the resulting synchronized product.

(a) For each atomic projection πv ∈ abs, compute T vA = TA ⊗ πv.(b) Select atomic projection πv minimizing VT v

A(α(s0)), the value of the

abstract initial state.

The minimization criterion in step 2 stems from the over-approximating

nature of our abstractions. Note that, overall, this projection selection

36

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

strategy is computationally quite involved as it requires Θ(|V|) compu-

tations of synchronized product and value function.

2. Symmetric variable refinement

Consider the boxworld domain with variables and their domains denoted

as follows.

(a) V = {box1, ..., boxk, truck1, ..., truckl, plane1, ..., planem}

(b) Dboxi = city1, ..., cityn, truck1, .., truckl, plane1, ..., planem.

(c) Dtrucki = Dplanej = city1, ..., cityn.

(d) A = {

load box i on truck j at city k,

unload box i from truck j at city k,

load box i on plane j at city k,

unload box i from plane j at city k,

drive truck i,

f ly plane j }

Note that each of those actions is instantiated for every box, city,

plane and truck accordingly, i.e., load box 7 on truck 2 at city 4.

The goal is to move boxes from their initial locations (=cities) to their

destination locations. Each box can be loaded to any truck or plane.

Trucks and planes move between cities. Reward is gained when a box

gets to its destination city. The box stays there and can never be moved.

Now we have to choose the order of all variables, boxes, planes and

trucks. But there is no a priori reason why we should choose one truck

variable trucki over the other truckj or why we should choose boxi over

boxj . In general, an abstract transition graph of this problem starts to

make sense when the variables of at least one box and one vehicle (truck or

plane) are in the abstraction. Thus any two boxi variables are considered

symmetric variables. The same is true for trucki and planei. Therefore a

logical order of variables for the linear merge strategy would be a round

robin on the groups of symmetric variables. For this domain, the order

will be box1, truck1, plane1, box2, truck2, plane2,...

37

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

To formulate this intuition of symmetric variables we used a simple

heuristic. We counted two measurements for each variable. As defined,

variables are part of each action precondition and action set of effects.

The first measurement is the number of preconditions a variable is de-

fined for. The second measurement is the number of effects a variable is

defined for. Every two variables having the same number of preconditions

and effects are considered symmetric.

In practice, this heuristic divided the variables into very semantically logi-

cal groups of equivalence for all domains! This strategy worked faster and

was similarly effective as the aforementioned variable lookahead merging

strategy.

3.5 Edge abstraction

In stochastic planning, the edge density in the state model is substantially

higher than in deterministic, classical planning where at each state, each

action induces just a single edge, corresponding to the single effect of that

action. Empirically, one of the most problematic aspects of the merge-and-

shrink abstractions for MDPs is that, even if the number of states in the

abstract state space is reasonably bounded, abstracting a typical PSAS+

MDP usually yields a huge amount of transitions, a few orders of magni-

tude more than the number of abstract states. This is problematic for two

reasons. First, the space complexity of the abstractions grows very quickly,

allowing us to work only with abstract state spaces having a relatively small

number of states. Second, the time complexity of computing value functions

for the intermediate abstractions grows at a similar rate as it is linear in the

number of transitions in the abstraction.

Considering this bottleneck of the number of edges in the abstract transi-

tion graphs, we aimed at identifying conditions under which we could safely

reduce the number of transitions. Below we describe our findings, and note

that all the conditions/properties described below are applicable, and thus

potentially useful, to deterministic planning as well.

Definition 18 Fixed action, effect and transition Let TA be the cur-

rent abstraction at an iteration of the merge-and-shrink algorithm with a

linear merge strategy on a PSAS+ task Π = 〈V,A, s0,R〉. Let V ′ ⊆ V be

38

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

the subset of all state variables v such that πv has already been merged into

TA. For each action a = 〈pre, E〉 ∈ A, and each effect e ∈ E,

1. e is called fixed if, for each variable v for which e is defined, v ∈ V ′.

2. a is called partially fixed if, for all e ∈ E, e is fixed.

3. a is called fixed if a is partially fixed and, for each variable v for which

pre is defined, v ∈ V ′.

A simple and yet very helpful observation is that, as the process of merge

and shrink continues, all merges will keep duplicating each fixed transition

because all the variables involved in the preconditions and effects of the

respective actions are already represented in the current abstraction.

Example 6 Consider the following example over boolean variables x =

{x0, x1}, y = {y0, y1} and z = {z0, z1}, and three simple deterministic ac-

tions

1. a = 〈{x = x0}, {x = x1}〉

2. b = 〈{y = y0}, {x = x1}〉

3. c = 〈{x = x0}, {y = y0}〉

Let us simulate the merge-and-shrink process, starting from the projec-

tion πx to the x variable as the current abstraction:

x0 x1a

b

bc

Notice that the transition labeled with a is fixed, because all of its pre-

condition and effect variables are in the current abstraction. Transitions

labeled b are not fixed because their precondition is not fixed but their effect

is fixed. Finally, the transition labeled c is not fixed, because its effect is not

fixed, though its precondition is fixed. Now, suppose that the next merge step

brings in the atomic abstraction πy:

39

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

x1y0

x1y1x0y1

x0y0

a

c

c

a

b

b

It is easy to see that all the fixed transitions labeled a are copied as is. All

fixed effects are also copied (labeled with b) but only to the upper part, because

the precondition was not fixed. Observing that all the transitions are now

fixed, consider the outcome of the next merge with the atomic abstraction

πz:

x0y0z0

x0y1z0 x0y1z1

x0y0z1 x1y0z1x1y0z0

x1y1z1x1y1z0

c

c

b b

a

b

c

c

a

b

a a

The example above illustrates that the evolution of fixed transitions

within the merge-and-shrink process is fully predictable, that is, we can

fully characterize their source and destination states after all future merge

and shrink steps. With care, this information can be used to eliminate or

merge transitions that are known not to influence the value function of the

final abstraction.

Coming back to the example above, consider the transitions from state

x0y0z0 to state x1y0z0 labeled with a and b, while keeping in mind that

40

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

s

t2

t1

a1, 0, 0.3

b2, 0, 0.5

c2, 0, 0.1

a2, 0, 0.7

b1, 0, 0.5

c1, 0, 0.9

Figure 3.9: Example of dominated actions. B = {b, c} dominates a.

our example corresponds to a special case in which there is exactly one

effect for each action. Given that both these transitions are fixed, if they

are associated with a similar reward, we can safely discard one of them

because this will not change the value function. Following this intuition,

below we formalize a criterion for a safe elimination of transitions from the

abstraction.

Definition 19 Dominated actions

Let TA be a current abstraction at an iteration of the merge-and-shrink

process, s be a state in TA, a be an action, and {s1, . . . , sn} be a set of all

states s′ in TA such that there is a transition labeled with a from s to s′. We

say that a set of actions B dominates an action a on state s if and only if

1. all the transitions from s labeled with B are to {s1, . . . , sn}, and

2. for any possible value assignment to states {s1, . . . , sn}, the expected

value of the action a will be lower than that of at least one action from

B.

41

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

Generic algorithm compute-abstraction (Π, N);abs←− {πv|v ∈ V \ Vinit};TA ←− πVinit ;while |abs| > 0 do

Remove dominated actions TA;Select merging projection πv ∈ abs;Shrink TA until size(TA) · size(πv) ≤ N ;abs←− abs \ {πv};Merge: TA ←− TA ⊗ πv;

endreturn T

Figure 3.10: Merge-and-shrink algorithm extended with dominated actionselimination.

Example 7 Dominated actions

Consider the abstraction depicted in Figure 3.9. The action set B =

{b, c} dominates the action a at state s because (i) if V (t1) > V (t2) then b

has a higher value than a, and (ii) if V (t1) < V (t2), then c has a higher

value than a. Hence, if all the aforementioned transitions are fixed, then

we can safely remove the transition induced by a without affecting the value

function.

Given the notion of action domination at a state, we can now enhance

the algorithm by removing dominated actions. Specifically, if action a is

dominated at state s by a set of action B, and also (i) a is partially fixed

and (ii) all actions B are fixed, we can safely erase all transitions of action

a from state s without affecting the value function at all. The extended

algorithm appears in Figure 3.10.

42

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

Chapter 4

Experiments

We have evaluated our framework on a set of benchmark domains from

the fully observable probabilistic track of the 2008 international planning

competition [11]. Most domains presented at the competition are planning-

like problems, which means that the reward is not zero only at the one

final action leading to the “goal” state. Those planning-like problems fall

under the definition of MDP tasks, but are very rare for standard MDPs

which generally offer a non-zero reward function, and have many possible

opportunities to gain reward. We compare our abstraction algorithm to

other state-of-the-art planners that participated in that competition.

The competition consisted of seven domains, each with fifteen problem

instances. Those domains were presented in PPDDL language [37]. Our

PSAS+ language is MVStrips based, thus requiring us to convert the do-

mains. A simple converter was made to translate each domain instance

automatically.

We next describe the actions, using only general names for each. Each

such action name can be instantiated for each different variable value. For

example, action pick up from table refers to all the different possibilities of

pick up boxi from table for every box variable i. Or unload box on truck at city

can be instantiated to unload boxi on truckj at cityk for all possible vari-

ables values i, j, k. Thus, the number of instantiated actions is polynomial in

the size of variable domains. In one domain it is exponential in the number

of variables.

43

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

4.1 Domains

4.1.1 Blocksworld

Variables:

V = {b1, ..., bn, hand, clearb1 , ..., clearbn , goal}Dbi = {b1, ..., bn, on− table, on− hand}Dhand = {b1, ..., bn, empty}Dcleari = Dgoal = {0, 1}Actions:

pick up block from, pick up block from table, put block on, put block on table,

pick up tower, put tower down on block, put tower down on table,goal.

This domain is similar to the classical Blocksworld domain. Besides the

standard actions of picking up blocks and putting them down, one could

pick up and put down a tower of two blocks. There is a chance that the

blocks will be scattered on the table. The likelihood of that is higher when

putting a tower down rather than a single block on another block or the

table. The actions cost some reward and the only reward received is when

reaching one specific combination of blocks. For our algorithm this domain

turned out to be extremely hard due to its “classical planning-like” nature,

induced by just a single reward at the end of the road.

4.1.2 Exploding blocksworld

Variables:

V = {b1, ..., bn, clearb1 , ..., clearbn , detonatedb1 , ..., detonatedbn ,destroyedb1 , ..., destroyedbn , hand, table ok, goal}Dbi = {b1, ..., bn, on− table, on− hand}Dcleari = Ddetonatedi = Ddestroyedi = {0, 1}Dhand = {b1, ..., bn, empty}Dtable ok = {0, 1}Dgoal = {0, 1}Actions: pick up block, pick up block from table, put down block on, put down block on table.

This domain is similar to the Blocksworld domain but without towers.

Actions might result in detonating a block. This will result in a dead end be-

44

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

cause positive rewards are not reachable. This is a very classical “planning-

like” domain, with one positive reward at the end and no negative rewards

at all.

4.1.3 Boxworld

Variables:

V = {box1, ..., boxk, truck1, ..., truckl, plane1, ..., planem}Dboxi = {city1, ..., cityn, truck1, .., truckl, plane1, ..., planem}.Dtrucki = Dplanej = {city1, ..., cityn}.Actions:

load box on truck at city, unload box from truck at city, load box on plane at city,

unload box from plane at city,

drive truck, fly plane, goal.

Any box can be loaded to any truck or plane. Trucks and planes move

between cities. Trucks may arrive at the wrong city. Each time a box is

placed at its destination city, a reward is gained. At the end, when all boxes

have reached their destinations, an additional reward is gained. All rewards

are positive.

4.1.4 Triangle tireworld

Variables:

V = {location, flat tire, has spare, spare atloc1 , ..., spare atlocm}Dlocation = {1, .., n}.Dflat tire = Dhas spare = Dspare atloci = {0, 1}.Actions:

move car,load tire,change tire

This domain is described in detail in examples 1 and 2. In short, a car

has to get from one location to another location, moving one location at

a time. The locations form a triangle-like map and some locations have

spare tires; see figure 3.1. There is a chance of getting a flat tire while

driving. One positive reward is gained at the end. This problem is sensitive

to value-iteration discount factor γ. To solve it optimally we used γ = 0.995.

45

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

4.1.5 Rectangle tireworld

Variables:

V = {x, y, dead, goal}Dx = Dy = {1, .., n}.Ddead = Dgoal = {0, 1}.Actions: move car, move car diagonal,ghost teleport,goal

This problem is similar to the triangle tireworld problem, but the loca-

tion is described as two variables x and y. Those locations form a rectangle-

like map. Additionally, some locations are not passable and some are deadly.

Driving diagonally is dangerous. The goal is to cross the map to the other

side. The rectangle-like problem has only four variables and is not interest-

ing (no abstraction is needed).

4.1.6 Schedule

Variables:

V = {C0, ..., Cn, C0 served, ..., Cn served, P0, ..., Pm, P0 dropped, ..., Pm dropped,

Phase,Alive,Goal}DCi = {P1, .., Pm, None}.DPi = {U0, .., Uk, Available}.DPhase = {C0, ..., Cn, P0, ..., Pm, Cleanup}.DCi served = DPi dropped = DAlive = DGoal = {0, 1}.Actions: process arrivals, time update, reclaim packet, packet serve, goal

This is a very complex domain with 7 different groups of variables. The

point is to serve all the classes (Ci variables). A class is served if it serves a

packet (Pi variable). The problem simulates a state machine, such that the

state of the machine is determined by the Phase variable. The Phase vari-

able always changes in the same way: C0, C1, ..., P0, P1, ..., Cleanup,C0, C1, ...

and so on. In each round of the phase variable (state C0, ..., Cn of the state

machine), packets are assigned to classes stochastically. Usually there is a

small number of classes and larger number of packets, and the system must

choose which packet to serve out of several. Thus, some packets have to wait

to be served (U0, ..., Uk are actually time limits after which the packets are

dropped). The time limit of the packets is updated at the P0, ..., Pm states.

46

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

The Cleanup state actually changes the Ci served variables, which lead to

the goal state. Only one class may serve one packet each phase round. All

packets may be dropped after some time. When a packet’s value is set to U0

it is being reclaimed, and this operation may shut down the system. This

is a hard domain, with one positive reward at the end when all classes are

served and no negative rewards.

4.1.7 Search and rescue

Variables:

V = {at, explored1, ..., exploredn, landable1, ..., landablen, human−onboard, human−alive, human− rescued, on− ground,mission− ended}Dat = {base, zone1, ..., zonen}.Dlandablei = Dexploredi = Dhuman−onboard = Dhuman−alive = Dhuman−rescued =

Dhuman−alive = Don−ground = Dmission−endedonboard = {0, 1}.Actions: go to, explore, land, take off, end mission

The goal of this problem is to rescue a person by choosing a good loca-

tion for a rescue helicopter to land. The problem is described by possible

locations where the helicopter may land, which are the domain values of the

at variable. The possible action is exploring those locations to check whether

they are landable. When the helicopter find a place to land, the human is

rescuedand brought by the helicopter to the base. There is a chance that

the human will die on the way. This is a very symmetric and easy-to-solve

problem, with one positive reward at the end and some negative rewards if

the human dies on the way.

4.1.8 Sysadmin

This domain is a bit different than the others. A network of computers

has to be managed. They have a chance of crashing. At each phase, any

computer might fail, depending on network topology. This domain has a

small number of variables (number of computers). But in order to convert

it to PSAS+ an exponential number of effects is needed. We did not test

this domain at all. If number of effects were small enough not to explode

the 4GB limit, we would solve this problem optimally because it has a small

number of states (abstraction is not needed).

47

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

4.2 Setup and environment

We used an environment similar to that of the competition. Forty minutes

were given for each domain instance. This includes creating the mapping of

an instance and running simulations to evaluate performance. We used one

core of a Core(TM)2 Quad CPU Q8200 @2.33GHz computer with a 4GB

memory limit.

Two domains (Seach and Rescue and Triangle Tireworld) were tested

with the MDPSim 2.2.2 used in the competition. We wrote an appropriate

planner to read our abstraction. This planner determines the best expected

action using our abstraction. Finally it converts states and actions from

PPDDL to PSAS+ and vice versa. Additionally we simulated these two

domains with our own PSAS+ simulator to validate the correctness of our

simulator. The other domains were tested with the PSAS+ simulator to

save programming time.

4.3 Results

The results of the six domains are depicted in Figures 4.1-4.6. The x-axis

in the graphs spans the different problem instances within the respective

domain, where a higher instance number corresponds to larger and more

complex problems, while the y-axis corresponds to the reward achieved by

different planners, averaged over 100 runs per problem instance.

1. RFF-BG/PG [34] - The winner of the fully observable probabilistic

track of the 2008 international planning competition

2. Among the planners that participated in the competition, the best-

performing planner for the domain in question

3. The average performance of the planners from the competition that

were capable of dealing with the domain in question.

4. Our abstraction algorithm with minimum difference value contraction

(MSVC) and, symmetric variable refinement (SVRS), and abstraction

size of 500.

5. MSVC, SVRS and abstraction size of 2000.

48

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15−100

−80

−60

−40

−20

0

20

40

Domain instance number

Ave

rage

rew

ard

Blocksworld domain

RFF−BG/PGDomain best − RFF−BG/PGDomain average*MSVC SVRS 500MSVC SVRS 2000MSVC SVRS 8000

Figure 4.1: Results for the blocksworld domain. *Average was taken on4/8 planners.

6. MSVC, SVRS and abstraction size of 8000.

As mentioned, the Blocksworld domain has one reward at the end. Each

instance has a different goal reward, and some action cost reward. That

is why many results are negative. Our algorithm did well on the first,

smaller instances but could not solve any of the harder ones. In the ex-

plosive Blocksworld domain, the reward was always 1 and there was no

penalty for actions. But similarly to Blocksworld domain, there was one

reward at the end, which made this domain even harder for our algorithm.

While the Boxworld domain is similar to Blocksworld in that moving

the boxes changes the system state, it is a bit more symmetric and, more

importantly, rewards are gained in intermediate phases, meaning that each

time one box is in its place, we get a reward. Each instance had different

rewards per box, starting from 1 and ending with 100 per box and an addi-

tional 1000 points for all boxes. Our algorithm performed better than the

others on this domain: it was the only one capable of solving the largest

instances, 14 and 15.

49

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

−0.2

0

0.2

0.4

0.6

0.8

1

1.2


Ave

rage

rew

ard

Explosive blocksworld domain

RFF−BG/PGDomain best − HMDPPDomain average*MSVC SVRS 500MSVC SVRS 2000MSVC SVRS 8000

Figure 4.2: Results for the exploding blocksworld domain. *Average wastaken on 5/8 planners.

The two Tireworld domains are presented together as in the competition.

The first 10 instances span from rewards of 0 to 100 and the last 5 from 0

to 1000. The Triangle Tireworld domain is similar to blocksworld because it

has one reward at the end. But because of the close-to-1 gamma (discount

factor), we could propagate the goal reward to other states and the algorithm

chose correctly which states to merge. Thus our results are optimal in most

cases.

Rectangle Tireworld is a bad example with which to compare our algo-

rithm because it has only 4 variables, each with huge domains. Thus, if

we can read the input file, we can solve the problem optimally; otherwise,

we cannot. Instance number 15 was too large (having more than 1 million

possible actions) for anyone to solve; the PSAS+ representation of that

problem weighed 0.5 GB. The schedule domain was limited to 100 reward

points when all classes were served, and our algorithm exhibited average

performance on that domain. The search and rescue domain had a poten-

tial of 2000 reward points. This domain was easy but many planners could

not solve it, probably due to advanced PPDDL language constraints. All

50

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

500

1000

1500

2000

2500

3000


Ave

rage

rew

ard

Boxworld domain

RFF−BG/PGDomain best − RFF−BG/PGDomain average*MSVC SVRS 500MSVC SVRS 2000MSVC SVRS 8000

Figure 4.3: Results for the boxworld domain. *Average was taken on 2/8planners.

planners that managed to parse the problem could solve it.

51

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

200

400

600

800

1000

1200

1400


Ave

rage

rew

ard

Two tireworld domain

RFF−BG/PGDomain best − HMDPP/RFF−BG/PGDomain average*MSVC SVRS 500MSVC SVRS 2000MSVC SVRS 8000

Figure 4.4: Results for the joint Tireworld domains as presented and testedin the competition. Instances 1 to 10 are Triangular and 11 to 15 are

Rectangular. *Average was taken on 4/8 planners.

52

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

20

40

60

80

100

120

140


Ave

rage

rew

ard

Schedule domain

RFF−BG/PGDomain best − SEHDomain average*MSVC SVRS 500MSVC SVRS 2000MSVC SVRS 8000

Figure 4.5: Results for the schedule domain. *Average was taken on 4/8planners.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

500

1000

1500

2000


Ave

rage

rew

ard

Search and rescue domain

RFF−BG/PGDomain best − FSP−RBHDomain average*MSVC SVRS 500MSVC SVRS 2000MSVC SVRS 8000

Figure 4.6: Results for the search and rescue domain. *Average was takenon 2/8 planners.

53

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

Chapter 5

Summary and Future work

Over the last two decades, numerous works have presented algorithmic ap-

proaches for factored MDPs. While those works on factored MDPs approx-

imate the value function of the original MDP, in this work we explored a

different approach of calculating the exact value function of an approximated

MDP. We exploited and extended a technique known as over-approximating

abstractions to approximately solve the exponential state space MDP prob-

lem.

An abstraction can, in general, be seen as a mapping that reduces the

size of the state space by compacting several states into one. If the abstract

state space is made small enough, the standard solutions for explicit state

space become feasible for it as well. We have adapted the merge-and-shrink

abstraction technique to devise compact controllers for MDPs, by suggest-

ing effective and semantically justifiable strategies for both state contraction

(shrink) and state-space refinement (merge). To cope with the huge number

effects in factored MDPs, we introduced the notion of action abstractions

to extend the merge-and-shrink abstraction technique both for MDPs and

for deterministic planning. This technique allows us to cope more efficiently

with resource limitations by merging or even removing some effects from

the abstract model, sometimes without loss of any valuable information. Fi-

nally, we provided a clear testbed evaluation for our methods and compared

them to other state-of-the-art approaches. The evaluation was carried out

on planning-like MDP domains, which gave our algorithm a natural disad-

vantage because its focus is on the general case of factored MDPs rather

than on planning-like domains. This disadvantage exhibited itself mostly by

54

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

the rewards, where planning-like domains have only one goal reward at the

goal state. Having said that, our approach appears comparable to state-of-

the-art algorithms.

Future work:

1. To address the natural disadvantage of our algorithm on the planning-

like domains, we need to use new heuristics. Those heuristics should

increase the goal state value propagation.

2. Introduce new ”safe” ways of reducing the abstraction. Goal unreach-

able states are those which cannot lead to other states that gain a

reward. An example of a safe way of reducing the abstraction is to

combine goal unreachable states into one dead-end state.

3. New edge abstraction techniques should be used to solve the largest

pitfall in the merge and shrink framework for MDPs: the hundreds

of millions of transitions. Those techniques will reduce the memory

consumption and time consumed. Example 6 gives us a feel for the

problem. That example shows fixed transitions which are defined in

definition 18. Because those transitions’ precondition and effect vari-

ables are already inside the abstraction, they are duplicated over and

over again. Note that in this example we only see transition duplica-

tion induced from one boolean variable. Other domains with multival-

ued variables and many different effects greatly increases the number

of transitions. One way to handle this is to statically analyze fixed

actions.

55

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

Bibliography

[1] J. Allen. Readings in planning. Morgan Kaufmann Publishers

Inc. San Francisco, CA, USA, 1994.

[2] R. Bellman. Dynamic Programming. Princeton University Press,

Princeton, NJ., 1957.

[3] R. Bellman, R. Kalaba, and B. Kotkin. Polynomial

Approximation–A New Computational Technique in Dynamic

Programming: Allocation Processes, volume 17. American Math-

ematical Society, 1963.

[4] D.P. Bertsekas. Dynamic programming. Prentice-Hall Englewood

Cliffs, NJ, 1987.

[5] D.P. Bertsekas and J.N. Tsitsiklis. Neuro-dynamic programming:

an overview. In Decision and Control, 1995., Proceedings of the

34th IEEE Conference on, volume 1, pages 560–564. IEEE, 1996.

[6] D.P. Bertsekas and J.N. Tsitsiklis. Neuro-dynamic programming:

an overview. In AIChE Symposium Series, pages 92–96. Citeseer,

2002.

[7] C. Boutilier, T. Dean, and S. Hanks. Decision theoretic planning:

Structural assumptions and computational leverage. Journal of

Artificial Intelligence Research, 11(1):94, 1999.

[8] C. Boutilier and R. Dearden. Approximating value trees in

structured dynamic programming. In MACHINE LEARNING-

INTERNATIONAL WORKSHOP THEN CONFERENCE-,

pages 54–62. Citeseer, 1996.

56

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

[9] C. Boutilier, R. Dearden, and M. Goldszmidt. Exploiting struc-

ture in policy construction. In International Joint Conference on

Artificial Intelligence, volume 14, pages 1104–1113. LAWRENCE

ERLBAUM ASSOCIATES LTD, 1995.

[10] C. Boutilier, R. Dearden, and M. Goldszmidt. Stochastic dy-

namic programming with factored representations. Artificial In-

telligence, 121(1-2):49–107, 2000.

[11] D.B.O. Buffet. International planning competition uncertainty

part: Benchmarks and results. In IPPC, 2008.

[12] J. Culberson and J. Schaeffer. Pattern databases. Computational

Intelligence, 14(4):318–334, 1998.

[13] T. Dean and R. Givan. Model minimization in Markov decision

processes. In Proceedings of the National Conference on Artificial

Intelligence, pages 106–111. Citeseer, 1997.

[14] S. Edelkamp. Planning with pattern databases. In Proceedings of

the European Conference on Planning (ECP), pages 13–34, 2001.

[15] G.J. Gordon. Stable function approximation in dynamic program-

ming. In Twelfth International Conference on Machine Learning,

1995.

[16] C. Guestrin, D. Koller, R. Parr, and S. Venkataraman. Efficient

solution algorithms for factored MDPs. Journal of Artificial In-

telligence Research, 19(10):399–468, 2003.

[17] M. Helmert, P. Haslum, and J. Hoffmann. Flexible abstraction

heuristics for optimal sequential planning. In Proc. ICAPS, pages

176–183, 2007.

[18] J. Hoey, R. St-Aubin, A. Hu, and C. Boutilier. Spudd: Stochastic

planning using decision diagrams. In Proceedings of the Fifteenth

Conference on Uncertainty in Artificial Intelligence, pages 279–

288. Citeseer, 1999.

[19] R.A. Howard. Dynamic programming and Markov process. MIT

press, 1960.

57

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

[20] A. Podelski K. Drager, B. Finkbeiner. Directed model checking

with distance-preserving abstraction. In Lecture Notes in Com-

puter Science, pages 19–34. Springer, 2006.

[21] M. Katz and C. Domshlak. Structural patterns heuristics via

fork decomposition. In Proceedings of the 18th International Con-

ference on Automated Planning and Scheduling (ICAPS), pages

182–189, 2008.

[22] M. Katz and C. Domshlak. Structural-pattern databases. In

Proceedings of the 19th International Conference on Automated

Planning and Scheduling (ICAPS), pages 186–193, 2009.

[23] D. Koller and R. Parr. Computing factored value functions for

policies in structured MDPs. In International Joint Conference

on Artificial Intelligence, volume 16, pages 1332–1339. Citeseer,

1999.

[24] D. Koller and R. Parr. Policy iteration for factored MDPs. In In

Proceedings of the Sixteenth Conference on Uncertainty in Artifi-

cial Intelligence UAI-00, pages 326–334, 2000.

[25] M.L. Puterman. Markov decision processes. Wiley-Interscience,

2005.

[26] M.L. Puterman and M.C. Shin. Modified policy iteration algo-

rithms for discounted Markov decision problems. Management

Science, 24(11):1127–1137, 1978.

[27] S.J. Russell and P. Norvig. Artificial intelligence: a modern ap-

proach. Prentice hall, 2009.

[28] D. Schuurmans and R. Patrascu. Direct value-approximation for

factored mdps. In Proc. NIPS, volume 14, 2001.

[29] P.J. Schweitzer and A. Seidmann. Ge neralized polynomial ap-

proximations in markovian decision processes. Journal of mathe-

matical analysis and applications, 110(2):568–582, 1985.

58

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

[30] R. St-Aubin, J. Hoey, and C. Boutilier. Apricodd: Approximate

policy construction using decision diagrams. Advances in Neural

Information Processing Systems, pages 1089–1096, 2001.

[31] R.S. Sutton. Learning to predict by the methods of temporal

differences. Machine learning, 3(1):9–44, 1988.

[32] P. Tadepalli and D.K. Ok. Scaling up average reward rein-

forcement learning by approximating the domain models and the

value function. In MACHINE LEARNING-INTERNATIONAL

WORKSHOP THEN CONFERENCE-, pages 471–479. Citeseer,

1996.

[33] J.A. Tatman and R.D. Shachter. Dynamic programming and in-

fluence diagrams. Systems, Man and Cybernetics, IEEE Trans-

actions on, 20(2):365–379, 1990.

[34] F. Teichteil-Konigsbuch, U. Kuter, and G. Infantes. Incremental

plan aggregation for generating policies in mdps. In Proceedings

of the 9th International Conference on Autonomous Agents and

Multiagent Systems: volume 1-Volume 1, pages 1231–1238. In-

ternational Foundation for Autonomous Agents and Multiagent

Systems, 2010.

[35] J.N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference

learning with function approximation. IEEE Transactions on Au-

tomatic Control, 42(5):674–690, 1997.

[36] Q. Yang. Intelligent Planning: A Decomposition and Abstraction

Based Approach (Artificial Intelligence). Springer–Verlag Berlin

Heidelberg, 1997.

[37] H.L.S. Younes and M.L. Littman. Ppddl1. 0: The language for

the probabilistic part of ipc-4. In Proc. International Planning

Competition, pages 70–73. Citeseer, 2004.

59

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

60

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

בקרים ליצירת הפשטה שיטותמרקוביים החלטה לתהליכי קומפקטיים

וורנוביצקי קולמן

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

בקרים ליצירת הפשטה שיטות

מרקוביים החלטה לתהליכי קומפקטיים

מחקר על חיבור

התואר לקבלת הדרישות של חלקי מילוי לשם

המחשב במדעי למדעים מגיסטר

וורנוביצקי קולמן

לישראל טכנולוגי מכון – הטכניון לסנט הוגש2011 יולי חיפה ה'תש"ע אדר

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

וניהול. לתעשיה מפקולטה דומשלק כרמל פרופ' בהנחיית נעשה המחקר

בהשתלמותי. הנדיבה הכספית התמיכה על לטכניון מודה אני

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

תקציר

את מעלה זאת יכולת אוטונומיות. למערכות הכרחית הינה התנהגות תכנון יכולתתכנון המטרה. להשגת פעולות רצף של בניה בעזרת המערכת של והעצמאות הגמישות.[21] עשורים משלושה למעלה כבר המלאכותית הבינה תחום במסגרת נלמד התנהגותתהליכים תכנון רובוטיקה, הכוללות שונות, מערכות בהרבה יושמו תכנון טכניקותהתנהגות תכנון חלל. משימות ובקרת עצמאים סוכנים אינטרנטי, מידע איסוף עסקיים,חיפוש המאפשרות שונים ואלגוריתמים הבעיה, ועולם הפעולות את המייצג מודל כולל

.[25 ,1] הפתרונות במרחב יעיל

תוצאה יש הסוכן לרשות שעומדת פעולה לכל בו במודל מתרכז דטרמיניסטי תכנוןשל רצף הוא זה מסוג בעיות עבור הפתרון .[25 ,1] מראש. וידוע אחת אפשריתהינו אופטימלי פתרון כאשר מטרה, למצב ההתחלתי ממצב לביצוע הניתנות פעולותלמצב הסוכן את להעביר מנת על שנדרשות הפעולות מספר את למינימום המביא פתרון

המטרה.

decision החלטות מונחה תכנון הינו סטוכסטי, תכנון הכולל תכנון של כללי יותר מקרהאו (תוכניות פעולות סכימת לייצר היא DTP מטרת .theoretic planning DTP

בניגוד זאת הסוכן. של מטרה פונקציית מעל גבוהה תוחלת משיגות אשר הסטרטגיות)ביותר הקטנה כמות עם מסוים מטרה מצב השגת הינה המטרה בו דטרמיניסטי לתכנוןשל מודל בעזרת סמנטית לתאר ניתן הסדרתיות ההחלטה הבעיות רוב את פעולות. של

.[2, 4, 14, 19] Markov decision processes, MDP מרקוביים תהליכים

value iteration כגון דינאמיים תכנות אלגוריתמי הם MDP לפתרון הקלסית הגישההאופטימלית ההחלטה את מחשבים אלו אלגוריתמים .[14] policy iterationו־ [2]

האפשרית ביותר הגבוהה התוחלת את להשיג מנת על ומצב מצב בכל לבצע צריך שסוכןפותחו אסטרטגיות של אופטימלית לבניה אחרים רבים פתרונות המטרה. פונקציית מעלmodified אלגוריתם בפרט הכוללים ,(operations research) ביצועים חקר של בתחוםpolicy iterationו־ value iteration של אסינכרוניות וגרסאות [20] policy iteration

א

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

המצבים מרחב של מפורשת ספירה דורשות שהן הוא אלו שיטות לכל המשותף .[4, 5]

.MDPה־ של

לפתרון טכניקות בין שילובים מספר האיר בעיות לפתרון כמודל MDP ב־ השימושקבלת תכנון, מלאכותית, בינה של תחום מתת הגיעו אלו טכניקות החלטה. בעיותמכל הנובעות המעניינות מההבחנות אחד ביצועים. וחקר וודאות, אי תחת החלטותיכולות ולכן מבניות ידי על מאופיינות החלטה בעיות לרוב כי העובדה הינה אלו עבודותמהמאפיינים אחד זאת. מבניות לנצל בנויים אשר יעודיים אלגוריתמים בעזרת להיפתרלרוב בו משתמשים אשר משתנים ידי על המצבים מרחב ייצוג הינו ביותר המפורסמיםלמטרות בה להשתמש ניתן אשר לבעיה מבניות מגדיר במשתנים ששימוש בזמן בתכנון.מכיוון זאת .MDP ל־ רגילים פתרונות של ביכולתם ספק מעלה גם הוא חישוביות,המצבים מרחב שייצוג בעוד מפורש, מצבים מרחב כלל בדרך מניחים רגילים שפתרונות

המפורשת. בצורתו המשתנים במספר כתלות לאקספוננציאלי הופך משתנים בעזרת

מרחב עם MDP בעיות של פתרון על עבודות כמה נכתבו האחרונים עשורים בשניFactored עבור ,Boutilier .Factored MDPs הנקראים משתנים מבוסס מצביםdynamic דינאמיות בייס ורשתות משתנים מבוסס מצבים במרחב משתמש [7] MDPs

עבודות המצבים. מודל של קומפקטי ייצוג לאפשר מנת על Bayesian network (DBN)

של לינארית קומבינציה בעזרת המטרה פונקציית שיערוך ע"י הפתרון את שיערכו אחרותBellman et al. 1963 [3]; השתמשו זאת בשיטה לינאריות). דוקא (לאו בסיס פונקציותהשתמש [12] Sutton, 1988 [22]; Tsitsiklis et al. 1997 [24]. Guestrin et al. 2003

מהמשתנים קטנה קבוצה תת כפרמטרים קבלה מהן אחת כל כאשר בסיס בפונקציותהמבוסס שונה פתרון הציעו [10] Dean & Givan 1997 המצבים. מרחב את שמגדיריםשקולה מינימאלית למכונה מצבים מכונת לדחיסת דומה זאת שיטה המודל. דחיסת עלבצורתו מובנה מצבים מרחב בעלת MDP בעיית כקלט מקבל שלהם האלגוריתם לה.המודל בגודל פולינומיאלית החסום מצומצם מפורש מודל ממנו לייצר ומנסה המופשטתלגודל הדחיסה הצלחת להבטיח יכול זה אלגוריתם אין כללי באופן כמובן, המופשט.

כנדרש. פולינומיאלי

את לשערך במקום אקספוננציאלי. מצבים במרחב לטיפול שונה גישה מציגה שלנו עבודהאנחנו הקודמות, עבודות שמבצעות כמו המקורית MDP בעיית של המטרה פונקצייתאשר משוערכת, MDP בעיית עבור אופטימלית מדוייקת מטרה פונקציית לחשב מציעיםמרחב המקורי. MDP ה־ בעיית מרחב את שמגדירים המשתנים כל בעזרת נבנתהזאת מראש, מוגדר פרמטר ידי על חסומה המשוערכת MDP ה־ בעיית של המצביםבאופן המשוערכת MDP ה־ בעיית את לפתור רגילים לאלגוריתמים לאפשר מנת על

אופטימלי.

ב

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

לתכנון קבילות חיפוש היוריסטיקות של בתחום ההתקדמות בהשראת נבנה זה רעיוןבצורה חיפוש, הינה דטרמיניסטיות בעיות לפתרון הסטנדרטית הגישה דטרמיניסטי.לשיפור ביותר החשובה הכללית השיטה הינן היוריסטיקות כאשר אחרת, או כזוהמטרה למצב המרחק את שמעריכות פונקציות הינן היוריסטיקות החיפוש. יעילותלבניית אחת שיטה החיפוש. אלגוריתם בניתוב עוזרות אלו פונקציות החיפוש. במרחב.[9, 11, 16] over-approximating abstractions הפשטה הינה קבילות היוריסטיקותמצבים קבוצות כיווץ ידי על המצבים מרחב את שמקטין מיפוי היא הפשטה כללי, באופןשיקולים להפעיל ישיר באופן ניתן המצבים, מרחב הקטנת בעזרת בודדים. למצביםאו לרוחב וסריקה לעומק סריקה כמו ידועים כלים ע"י לדוגמא המכווץ, המבנה כל עלהיוריסטיקות של בניה מאפשרים הפשטות על אלו שיקולים .Dijkstra של האלגוריתם

שלמים. אקספוננציאליים מצבים מרחבי על

של ליצירה ורק, אך לא אם בעיקר, משמשות דטרמיניסטי בתכנון שהפשטות בעודהמטרה המטרה. אל כלשהוא ממצב המרחקים של קבילות אינפורמטיביות הערכותככל שמירה תוך המצבים, מרחב את לדחוס מנת על בהפשטות להשתמש היא שלנועל הרבה הכי משפיעים אשר המקורית הבעיה של המקוריים המאפיינים על האפשרבקיצוניות הקיצוניים, במקרים המשאבים. מגבלות ידי על מאופיינת הדחיסה הפתרון.אף תדחוס לא השנייה ובקיצוניות אחד, למצב המצבים כל את תדחוס ההפשטה האחת,אך קטנה, זיכרון כמות לנו יש כאשר ביותר, המעניינים כמובן הם ביניים מקרי מצב.

ביותר. הטובה בצורה בו להשתמש היא והמשימה מבוטלת, לא

merge-and-shrink ודחיסה מיזוג של ההפשטה טכניקת את מכלילים אנו הטכני, מהצדאוטומטים מערכות אימות של מתחום [15] Drager, Finkbeiner & Podelski שלHelmert, ידי על דטרמיניסטית תכנון של בהקשר והורחבה אומצה אשר מבוזרים,של הרכב על נשענת זו גישה של החישובית הכדאיות .[13] Haslum & Hoffmann

שמראה כמו ביניים. מצבי של הפשטה עם מצב( )משתני השונים המערכת מאפייניהמתקבלות מאוד מדויקות היוריסטיקות של יצירה מאפשרת זאת הרכבה [13] Helmert

הפשטות בבחירת גדולה גמישות מאפשרת הטכניקה יחסית. קומפקטיות הפשטות בתוךבחירת של קשה בעיה חסרון. גם מכילה זאת גמישות בלבד. הטלה הפשטות לא שהם

יותר. עוד קשה נעשית הפשטות, של עצום מספר מיני אחת הפשטה

עבור קפדנית סמנטיקה מספקים אנו ראשית, כדלקמן. הן שלנו העבודה של התרומותיעילה הסטרטגיה מציעים אנו שנית, מובנים. MDPs על ודחיסה המיזוג של אופרטורים(מיזוג). המצבים מרחב ועידון (דחיסה) מצבים דחיסת עבור סמנטית מבחינה המוצדקת

שונות. בעיות על המוצעת האסטרטגיות של היעילות את מנתחים אנחנו

המעברים מספר דטרמיניסטי, בתכנון מספיק. אינו המצבים מרחב פישוט לפעמים

ג

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

הפעולות מספר הוא Aו־ המצבים מספר הוא S כאשר S · A ידי על חסום האפשרייםאפילו הוא המעברים מספר וכך אפשרויות, מספר ייתכנו פעולה לכל DTP ב־ השונות.merge- ה־ טכניקת את הרחבנו המצבים, הפשטת שיטת לאימוץ בנוסף יותר. גדולפעולות. הפשטת של טכניקה עם דטרמיניסטי תכנון ועבור MDPs עבור and-shrink

או מיזוג ידי משאבים מגבלות עם יותר יעיל באופן להתמודד לנו מאפשרת זאת טכניקהחשוב מידע איבוד ללא לפעמים זאת המופשט, מהמודל פעולות כמה של הסרה אפילו

כלשהו.

אחרות השיטות עם אותם ומשווים שלנו השיטות של אוולואציה על מדווחים אנו לבסוף,הוא מובנים, MDPs עבור (יעיל) פתרון של זה אמפירי מחקר בתחום. ביותר הטובותהשתמשנו שלנו, האלגוריתם את להעריך מנת על שלנו. העבודה של העיקרי המוקדב־2008 בתכנון הבינליאומית התחרות של ההסתברותי המסלול של וכללים בתשתיתשמונה נגד מהתחרות הבעיות של שונות משפחות בשש נבחן שלנו האלגוריתם .[8]

התכנון בתחום מקורם חלקם אבל MDPs הם התחומים כל מתחרים. אלגוריתמיםמשפחות שש מתוך בשלוש יחיד. מטרה מצב מונחה למבנה מוגבלים ולכן הדטרמיניסטימהתחרות. הטובים אלגוריתמים מאשר יותר טובים ביצועים הציגה שלנו הגישה הבעיות,

ד

Tec

hnio

n -

Com

pute

r Sc

ienc

e D

epar

tmen

t - M

.Sc.

The

sis

MSC

-201

1-21

- 2

011

Documents

Abstractions for devising compact controllers for MDPs€¦ · compact controllers for MDPs Research Thesis Submitted in partial ful llment of the requirements for the degree of Master